Everything You Need To Understand About Monitoring System
Monitoring the state of availability of our systems infrastructure is vital to anticipate incidents and allows us to act more quickly in the event that they occur, it also allows us to detect security incidents that could be overlooked due to their ‘low’ level impact on the performance of the affected computer. Everyone knows that in recent years monitoring has greatly facilitated the performance of preventive system maintenance tasks :
- It is no longer necessary to connect computer by computer to review, for example, the state of use of its resources: disk space, CPU or RAM memory usage. Monitoring provides us with a centralized point where the information from the different systems is available.
- We can prioritize our interventions in those teams that monitoring has detected that require more attention.
- We have performance histories that greatly facilitate us to estimate the expansions or reductions of resources for a system. They can make it easier for us to obtain figures for SLAs .
But we are not only talking about preventive actions , but a large part of its value also lies in the detection of incidents in real time: receiving alerts(whether via email, SMS or others) in which we are notified of the unavailability of a service, allows us to provide a rapid response to these events, also facilitating the resolution of incidents.
Imagine this situation: we receive an alert that the database service of our business ERP is not responding, the systems technicians quickly get to work on its resolution, anticipating that users may be affected by it, and having a clear focus on what the point of failure is in the event that someone reports a failure in the management application, since the alert has informed us that the database service is not available.
And how does this monitoring help us to detect security compromises ?in the elements of our infrastructure? Imagine that a server has been compromised and is now part of a cryptocurrency mining botnet. How to detect it if the server is still online and providing its services in a ‘normal’ way if we do not have an advanced security solution such as a SIEM? The monitoring of it will reflect abnormal CPU usage values, and from that moment we will act accordingly when alerted to abnormal performance. What if our on premise mail server is being used to send spam? Constant monitoring of the mail queue or the traffic of the server’s network card will allow us to detect this type of compromise. Certainly,
Traditionally we have been monitoring values of RAM, CPU, disk occupancy or status of the equipment services, but the reality is that we can extend the monitoring not only to the final equipment or servers, but we can also monitor the infrastructure that supports them. , some examples of values that we can and should monitor are:
- Status of VMware virtual infrastructure: space in datastores, health status of ESXi hosts, usage values of ESXi hosts, existing snapshots in virtual machines,.
- Health status of storage cabinets: failed disks and other hardware errors, status of replication between cabinets (if any), status of the arrays,…
- Blade chassis hardware health status and other server range equipment .
- UPSs : load status, information on input/output voltages, environment values…
- PDUs : device status, input/output lines.
- Switches : hardware health status, traffic of the different interfaces, existence of loops,…
In addition to infrastructure monitoring, detailed monitoring of elements such as:
- Database : MySQL (sizes, expensive queries, connections…), MS SQL Server (sizes, connections, locks…), Oracle, Postgre
- Network services: active directory availability, DHCP, internal DNS,…
- Printer infrastructure: toner levels, printer hardware failures,…
- Microsoft IIS : status of pools, web applications,.
The above are just some examples, but the list of elements to monitor is very extensive. Let us also remember that in many cases this monitoring can be done using the SNMP protocol , which is lightweight and does not involve the installation of any type of agent or third-party application on the equipment to be monitored. Even for a large number of health checks, the use of SNMP queries is not even necessary, and can be done through the use of common network protocols .
Going a step forward, it is worth noting the usefulness of integrating the monitored elements into maps that allow us to have a quick overview of the status of our systems.
There are several existing monitoring platforms on the market, some of the best known are Nagios , Centreon , Zabbix , SolarWinds , PRTG , and of course Microsoft’s SCOM . Among them, some are open source, and others are paid, there are more basic ones and others with more advanced features, the choice of one or another solution will depend on the characteristics of our organization and the functionality that we want to obtain from it.