Our beautiful API is getting better and better, now monitoring is also activated: Vertrauen ist gut – Kontrolle ist besser.
We have integrated various monitoring approaches to ensure that we are always up to date on how our system is doing.
With Infrastructure Monitoring, we initially only secure the elementary basic requirements and determine whether the API can function at all.
For example, we monitor the utilisation of the CPU, main memory, hard drives, and network.
With our agent integrated into the application, we register incoming requests, the corresponding responses, processing times, and recognise exceptions. All of this information is transmitted to a monitoring tool and forms our Passive Monitoring.
However, even this type of monitoring reaches its limits if you imagine that no requests are made to our API for a while.
For this reason, we use our Synthetic Monitoring to actually call the various API endpoints, as we assume or expect them to be called by our consumers.
All of the above monitors require metrics that can then be further processed in tools such Prometheus or Datadog.
But what are metrics and which metrics are useful?
Metrics capture a value that relates to our systems at a specific point in time – for example, the number of users currently logged into a web application. Therefore, metrics are usually collected once per second, once per minute or at some other regular interval to monitor a system over time.
There are two important categories of metrics: work metrics and resource metrics. For each system, that is part of our software infrastructure, we find out what work and resource metrics are available and collect them.
Work Metrics
They show the state of our system at the highest level. When looking at the work metrics, it is often helpful to divide them into four subtypes:
- Throughput is the amount of work the system performs per unit of time. Throughput is usually recorded as an absolute number. For a web server, for example, this is the number of requests per second.
- Success Metrics represent, for example, the percentage of successfully processed requests for a web server, i.e. the responses with HTTP 2xx.
- Error Metrics capture the number of erroneous results, usually expressed as error rate per unit of time or normalised by throughput to count errors per unit of work. Error metrics are often tracked separately from success metrics when there are multiple potential sources of error, some of which are more severe than others. In our web server example, this is the percentage of responses with HTTP 5xx.
- Performance Metrics quantify how efficiently a component is doing its job. The most common performance metric is latency, which is the time it takes to complete a unit of work. Latency can be expressed as an average or as a percentage, e.g. “99 per cent of requests are returned within 0.1 second”.
These metrics are incredibly important for monitoring. They make it possible to quickly answer the most pressing questions about the internal state and performance of a system: is the system available and actively doing what it was built to do? How is the performance? What is the system’s error rate?
Resource Metrics
Most components of our software infrastructure serve as resources for other systems. Some resources are low-level resources – for example, the resources of a server include physical components such as CPU, memory, hard drives, and network interfaces. But a higher-level component, such as a database, can also be considered a resource if another system requires that component.
Resource metrics can help us reconstruct a detailed picture of system health, making them particularly valuable for investigating and diagnosing problems. We collect metrics for each resource in our system that cover four key areas:
- Utilisation is the percentage of time the resource is busy or the percentage of the resource’s capacity that is being used.
- The saturation state (e.g. of a queue) is a measure of the number of tasks that the resource cannot yet serve and that are still in the queue.
- Errors represent internal errors that may not be observable during operation.
- Availability represents the percentage of time the resource has responded to requests. This metric is only clearly defined for resources that can be actively and regularly checked for availability.
To illustrate this, here is an overview of the metrics for various resources:
Microservice |
Average percentage of time the service was busy |
Requests not processed |
Internal errors of the microservice |
Percentage of time the microservice was available |
Datenbank |
Average percentage of time each database connection was busy |
Unprocessed requests |
Internal errors, e.g. replication errors |
Percentage of time the database was accessible |
Metric Types
In addition to the categories for metrics, the different types of metrics are also important. The metric type influences how the corresponding metrics are displayed in a tool, such as Datadog or Prometheus.
The following different types exist:
-
Count
-
Rate
-
Gauge
-
Histogramm
The Count metric type represents the number of events during a specific time interval. It is used, for example, to record the total number of all connections to a database or the number of requests to an endpoint. The Count metric type differs from the Rate metric type, which records the number of events per second in a defined time interval. The metric type Rate can be used to record how often things happen, such as the frequency of connections to a database or requests to an endpoint. The Gauge metric type returns a snapshot value for a specific time interval. The last value within a time interval is returned in each case. A metric with the gauge type is primarily used to measure values continuously, e.g. the available hard drive space. The histogram metric type can be used to determine the statistical distribution of certain values during a defined time interval. It is possible to determine the “average”, “number”, “mean”, or “maximum” of the measured values.
Events
In addition to metrics that are recorded more or less continuously, some monitoring systems can also record events: discrete, infrequent occurrences that can play a crucial role in understanding the behavioural changes of our system. Some examples:
- Changes: internal code releases, builds and build errors
- Warnings: internally generated warnings or notifications from third-party providers
- Scaling resources: adding or removing hosts
Unlike a single metric data point, which is generally only meaningful in context, an event usually contains enough information to be interpreted on its own. Events capture what happened at a particular point in time, with optional additional information.
Events are sometimes used to generate warnings – someone should be notified of events that indicate that critical work has failed. But more often they are used to investigate problems and correlate across systems. Events should be treated like metrics – they are valuable data that needs to be collected wherever possible.
But what should good data look like?
The data collected should have four characteristics:
-
Good comprehensibility
We should be able to quickly determine how each metric or event was captured and what it represents. During an outage, we do not want to spend time trying to figure out what our data means. We keep metrics and events as simple as possible and name them clearly. -
Appropriate time interval
We need to collect our metrics at appropriate intervals so that any problems become visible. If we collect metrics too infrequently or average values over long time windows, we may lose the ability to accurately reconstruct the behaviour of a system. For example, periods with a resource utilisation of 100 percent will be obscured if they are averaged with periods of lower utilisation.
On the other hand, it must also be kept in mind that collecting metrics or synthetic monitoring in particular, means a certain (continuous) load anda) “takes away” resources from the “actual” work and
b) overlays its data, as synthetic monitoring is naturally reflected in passive monitoring. If a time interval is selected that is too short, this can lead to the system being noticeably overloaded or not containing any meaningful data. Wählt man also ein zu kurzes Zeitintervall, kann das dazu führen, dass das System merklich überfordert ist oder keine aussagekräftige Daten enthält. -
Scope reference
Suppose each of our services operates in multiple regions and we can check the overall health of each region or their combinations. It is then important to be able to allocate metrics to the appropriate regions so that we can alert on problems in the relevant region and investigate outages quickly. -
Sufficiently long data retention
If we discard data too early or our monitoring system aggregates our metrics after some time to reduce storage costs, we lose important information about the past. Keeping our raw data for a year or more makes it much easier to know what is “normal”, especially if our metrics show monthly, seasonal or annual fluctuations.
With our collected metrics, we are now able to provide information about the health of our system at any given time.
BUT: Who wants to spend all day looking at the charts generated and checking the status of the system? There should be something that lets us know when unusual things happen.
→ An alarming system
Drawing attention to the essentials
Automated alerts are essential for monitoring. They allow us to detect problems anywhere in our infrastructure so that we can quickly identify their causes and minimise service disruptions and interruptions.
While metrics and other measurements facilitate monitoring, alerts draw attention to the specific systems that require observation, inspection and intervention.
But alerts are not always as effective as they could be. In particular, real problems often get lost in a sea of messages. Here we want to describe a simple approach to effective alerting:
- Alert generously, but judiciously.
- Alert about symptoms rather than causes
When should you alert someone (or no one)?
An alert should communicate something specific about our systems in plain language: “Two Cassandra nodes are down” or “90 per cent of all web requests take more than 0.5 seconds to process and respond”. By automating alerts for as many of our systems as possible, we can respond quickly to problems and provide a better service. It also saves us time by freeing us from the constant, manual checking of metrics.
Levels of Alert Urgency
Not all alerts have the same urgency. Some require immediate intervention, some require eventual intervention and some indicate areas that may require attention in the future. All alerts should be logged in at least one centralised location to allow easy correlation with other metrics and events.
Alerts as Records (Low Severity)
Many alerts are not associated with a service issue, so a human may not even notice them. For example, if a service responds to user requests much slower than usual, but not so slow that the end user finds it annoying, this should generate a low severity alert. This is recorded in monitoring and stored for future reference or investigation. Finally, temporary problems that could be responsible, such as network congestion, often disappear on their own. However, should the service return a large number of timeouts, this information provides an invaluable basis for our investigation.
Warnings as Notifications (Medium Severity)
The next level of alerting urgency concerns problems that require intervention but not immediately. The data storage space may be running low and should be scaled up in the next few days. Sending an email and/or posting a notification in a dedicated chat channel is a perfect way to deliver these alerts – both message types are highly visible but they don’t wake anyone up in the middle of the night and don’t disrupt our workflow.
Warnings as Alerts (High Severity)
The most urgent alerts should be given special treatment and escalated immediately to get our attention quickly. For example, response times for our web application should have an internal SLA that is at least as aggressive as our most stringent customer-facing SLA. Any instance of response times exceeding our internal SLA requires immediate attention, regardless of the time of day.
When should you leave a sleeping engineer? 😉
When we consider setting an alarm, we ask ourselves three questions to determine the urgency of the alarm and how it should be handled:
-
Is this problem real?
It may seem obvious but if the problem is not real, it should not normally generate an alarm. The following examples may trigger alarms but are probably not symptomatic of real problems. Alerting on events like these contributes to alert fatigue and may cause more serious problems to be ignored:-
Metrics in a test environment are outside established limits.
-
A single server is performing very slowly but is part of a cluster with fast failover to other machines and is still rebooting regularly.
-
Planned upgrades result in a large number of machines being reported as offline.
-
If the problem actually exists, a warning should be generated. Even if the alert is not linked to a notification (via email or chat), it should be recorded in our monitoring system for later analysis and correlation.
-
Does this problem require attention?
There are very real reasons to call someone away from work, sleep or their private time. However, we should only make use of this when there really is no other way. In other words, if we can adequately automate a response to a problem, we should consider doing so. However, if the problem is real and requires (human) attention, an alert should be generated to notify someone, who can investigate and fix the problem. Depending on the severity of the problem, it may wait until the next morning. We can therefore distinguish between different ways of notification: the notification should at least be sent by email, chat or a ticketing system so that the recipient can prioritise their response. Otherwise, calls, push notifications to mobile phones, etc. are also conceivable in order to draw attention to a problem as quickly as possible. -
Is this problem urgent?
Not all problems are emergencies. For example, perhaps a moderately high percentage of system responses were very slow or a slightly higher percentage of queries are returning stale data. Both problems may need to be fixed soon but not at 4.00 a.m. If, on the other hand, the performance of a key system drops or stops working, we should check immediately. If the symptom is real and requires attention and it is acute, an urgent alert should be generated.
Fortunately, monitoring solutions such as Prometheus or Datadog offer the option of connecting various communication channels such as email, Slack (chat) or even SMS.
Symptom Alert
In general, an alert is the most appropriate type of warning when the system we are responsible for can no longer process requests with acceptable throughput, latency or error rates. This is the kind of problem we want to know about immediately. The fact that our system is no longer doing useful work is a symptom – that is, it is the manifestation of a problem that can have a number of different causes.
For example: if our website has been responding very slowly for the last three minutes, this is a symptom. Possible causes include high database latency, down application servers, high load, etc. Wherever possible, we base our alerting on symptoms rather than causes.
Alarming on symptoms leads to real, often user-related problems and not hypothetical or internal problems. Let’s compare alerting on a symptom, such as slow website responses, with alerting on possible causes of the symptom, such as high utilisation of our web servers:
Our users won’t know or care about the server load if the website is still responding quickly and we’ll be annoyed if we have to take care of something that is only noticeable internally and can return to normal levels without intervention.
Long-lasting Definition of Warnings
Another good reason to refer to symptoms is that alerts triggered by symptoms tend to be persistent. This means that regardless of how the underlying system architectures may change, even without updating our alert definitions, we will receive a corresponding message if the system no longer functions as it should.
Exception to the Rule: Early Warning Signs
It is sometimes necessary to focus our attention on a small handful of metrics, even if the system is functioning appropriately. Early warning values reflect an unacceptably high probability that serious symptoms will soon develop and require immediate intervention.
Hard disc space is a classic example. Unlike a lack of free memory or CPU, the system is unlikely to recover if we run out of hard drive space and we certainly have little time before our system hard-stops. Of course, if we can notify someone with enough lead time, we don’t have to wake anyone up in the middle of the night. Better yet, we can anticipate some situations where space is running low and create an automatic fix based on the data we can afford, such as deleting logs or data that exists elsewhere..
Conclusion: Take Symptoms Seriously
- We only send an alert if symptoms of serious problems are detected in our system or if a critical and finite resource limit (e.g. hard drive space) is about to be reached.
- We set up our monitoring system so that it records alerts as soon as it recognises that real problems have occurred in our infrastructure, even if these problems have not yet affected overall performance.
Process chain monitoring
Hooray, we get an alarm: “Our storage system has failed”.
This is still no cause for celebration, as the boss is at the door and wants to know which business functions are affected. With our current setup, it is not possible to provide this information immediately.
Let’s take a closer look at this with an example: A hypothetical company operates internal and external services. The external ones consume internal services. Both types of services have dependencies on resources, such as our storage system. With a dependency tree, it is possible to model these dependencies. If an element in our tree fails, we can automatically determine which dependent systems are affected. This enables advance information to be sent to the helpdesk, first-level support, etc. and thus takes pressure and stress off the team working on the problem.
By using weighted nodes, it is possible to calculate the criticality. This in turn makes it possible to prioritise work on individual outages.
Fortunately, parallel failures only occur in theory and certainly not at weekends or on public holidays. 😉
It should be noted here that the IT landscape in companies is “alive”. The consequence of this is that anyone, who has this kind of end-to-end monitoring must also keep this model/tool up to date. This always means effort, which is often only worthwhile for critical company processes. For example, if a bot that publishes the current canteen menu in a chat tool fails, it is not worth rolling out end-to-end monitoring for this.
Conclusion
Monitoring enables us to make statements about the status of our systems at any time. Alarming draws attention to system anomalies. Process chain monitoring allows us to make a statement regarding the effects of anomalies.
The prerequisite for a sensible and successful monitoring and alarming system is an overall concept, which must be developed in advance and constantly revised.