Before we start lets think first what is a system reliability means. In simple words, this is the probability of a product performing its intended function under stated conditions without failure for a given period of time. It means, among other things, continuous monitoring of the state of the system. Why this is so important and how to make system be reliable? Today, we will try to answer these two questions by defining a ‘golden metrics’ which help to track basic Service level objectives (SLOs) in a distributed IT system.
Based Google SRE workbook, in distributed system will able to define the four golden rules:
- Availability. The amount of time or as a percentage, that an system is able to fulfill its intended function.
- Latency. The time it takes to service a request. Frequently, latency is a synonym for delay.
- Traffic. A measure of how much demand is being placed on your system. For a web service, this measurement is usually HTTP requests per second.
- Success/Fail rate. The rate of requests that success or fail.
- Saturation. How “full” your service is.
Measuring only these four characteristics of a system will provide us with all necessary information about reliability.
But not all systems are the same. Before we start defining ‘golden metrics’, lets describe a types of components against which we will construct our rules.
Common three types of components are:
- Request-driven. The user creates some type of event and expects a response. For example, this could be an HTTP service where the user interacts with a browser or an API for a mobile application.
- Pipeline or background process. A system that takes records as input, mutates them, and places the output somewhere else.
- Storage. A system that accepts data and makes it available to be retrieved later.
Now, we know the basics and it is time to describe ‘golden metrics’ based on type of a IT system:
Type | Request-driven | Request-driven | Storage | Request-driven | Pipeline | Pipeline |
Example | User-faced web application, SaaS/PaaS | API, Serverless function | Database, Blob storage | Mobile application | CI/CD pipeline | Backend process |
Rule | Availability, [% time] | Availability, [% time] | Availability, [% time] | Availability, [% time] | Pass rate, [%] | Availability, [% time] |
Rule | Latency, [ms] | Latency, [ms] | Latency, [ms] | Latency, [ms] | Duration. [sec] | Latency, [ms] |
Rule | Traffic, [req/s] | Traffic, [req/s] | Durability | Performance | Test pass rate, [%] | Correctness, [%] |
Rule | Success rate, [%] | Success rate, [%] | Saturation, [%] | Responsiveness | Code coverage, [%] | Throughput, [% Kbsp] |
Rule | Success rate, [%] | Success rate, [%] | Deadlock, [count/min] | Coverage | ||
Rule | Saturation, [%] | Throughput, [% Kbsp] | Read/Write rate | Volume | ||
Rule | Correctness, [%] | |||||
Rule | Quality |
Request-driven metrics
This is very important measuring metrics in correct way. I would like to start with defining correct indicators first (SLI).
Each system is unique from architecture point of view and you have to adjust rules based on it. Below the example how to define and implement metrics based on defined SLO:
SLI | Unit | Calculation | Interval | Aggregation | SLO | Measurement |
Availability – Success rate | % req | count(req == ‘2XX’) / total(req) | Average [7d, 30d] | 99,5% | The proportion of requests that resulted in a successful response. HTTP 200 / total req | |
Availability | % time | sum(uptime) / sum(total time) | 5min | Average [7d, 30d] | 99,5% | How to measure uptime? PaaS or custom health check |
Latency | % req | count(req == ‘2XX’ && RPC < SLO bucket 1) / total(req) | 5min | Average [7d, 30d] | 90% | The proportion of requests that were faster than some threshold: SLO bucket 1 = 100ms |
Latency | % req | count(req == ‘2XX’ && RPC < SLO bucket 2) / total(req) | 5min | Average [7d, 30d] | 95% | The proportion of requests that were faster than some threshold: SLO bucket 2 = 300ms |
Latency | % req | count(req == ‘2XX’ && RPC < SLO bucket 3) / total(req) | 5min | Average [7d, 30d] | 99% | The proportion of requests that were faster than some threshold: SLO bucket 3 = 1s |
Traffic | req/7d, 30d | total(req) | 7d, 30d | Total request per week/month | ||
Traffic | req/sec | count(req) / Period | Average, Max [1sec, 1 min, 1h] | Define peak capacity periods: start of week/end of week/month | ||
Saturation, CPU | % | current usage / max amount | 5min | Average, Max [7d, 30d] | 95% | CPU usage Many systems degrade in performance before they achieve 100% utilization |
Saturation, RAM | % | current usage / max amount | 5min | Average, Max [7d, 30d] | 95% | RAM usage Many systems degrade in performance before they achieve 100% utilization |
Fail rate | % req | count(req == ‘5XX’) / total(req) | 5min | Average [7d, 30d] | 0,5% | HTTP 5XX / total req |
Quality | % req | count(req == ‘2XX’) / total(req in undegraded state) | 5min | Average [7d, 30d] | 0,5% | – when service or backends is unavailable, you need to measure the proportion of responses that were served in an undegraded state – graceful shutdown – better user experience |
Some of these SLIs may overlap: a request-driven service may have a correctness SLI, a pipeline may have an availability SLI, and durability SLIs might be viewed as a variant on correctness SLIs. I recommend choosing a small number (five or fewer) of SLI types that represent the most critical functionality to your customers.
Storage
This kind of workloads are characterized of durability and availability in most cases. The proportion of records written that can be successfully read.
Different aspects of a system should be measured with different levels of granularity. For example:
- Observing CPU load over the time span of a minute won’t reveal even quite long-lived spikes that drive high tail latencies.
- On the other hand, for a web service targeting no more than 9 hours aggregate downtime per year (99.9% annual uptime), probing for a 200 (success) status more than once or twice a minute is probably unnecessarily frequent.
- Similarly, checking hard drive fullness for a service targeting 99.9% availability more than once every 1–2 minutes is probably unnecessary.
SLI | Unit | Calculation | Time period | Aggregation | SLO | Measurement |
Availability | % | uptime / total time | Average [7d, 30d] | 99% | ||
Latency | RPC< 100ms | Average [7d, 30d] | 100ms | |||
Durability | % | The proportion of records written that can be successfully read. | ||||
Saturation, CPU | % | current usage / max amount | 5min, 60 min, 1 day | Average, Max | 95% | |
Saturation, RAM | % | current usage / max amount | 5min, 60 min, 1 day | Average, Max | 95% | |
Saturation, I/O | % DTU | current usage / max amount | 5min, 60 min, 1 day | Average, Max | 95% | |
Saturation, free space | % | current usage / max amount | 5min, 60 min, 1 day | Average, Max | 95% | |
Throughput | % | Traffic, Kbps / Bandwidth, Kbps | 5min | 95% | ||
Deadlock | number/time | number of deadlocks/hour | 5min, 1h | – Number of deadlocks in period of time – Number of deadlocks per 1000 requests | ||
Read | number | 1sec | Average, Max | |||
Write | number | 1sec | Average, Max |
Pipeline
This might be a simple process that runs on a single instance in real time, or a batch process that takes many hours. Examples include:
- A system that periodically reads data from a relational database and writes it into a distributed hash table for optimized serving
- A video processing service that converts video from one format to another
- A system that reads in log files from many sources to generate reports
- A monitoring system that pulls metrics from remote servers and generates time series and alerts
A good example of pipeline process is CI/CD.
SLI | Unit | Calculation | Time period | Aggregation | Measurement |
Pass rate | % | number of success run/total number of runs | 30d | Commonly CI/CD tool provides such metrics put-of-the-box | |
Duration | time | time to run | 30d | Commonly CI/CD tool provides such metrics put-of-the-box | |
Test pass rate | % | number of test failed/total number of runs | 30d | Commonly CI/CD tool provides such metrics put-of-the-box | |
Code coverage | % | code coverage | 30d | Commonly CI/CD tool provides such metrics put-of-the-box | |
Freshness | count(record in last 10 days) / total(record) | The proportion of the data that was updated more recently than some time threshold. Ideally this metric counts how many times a user accessed the data, so that it most accurately reflects the user experience. | |||
Correctness | Inject data with known outputs into the system, and count the proportion of times that the output matches our expectations | The proportion of records coming into the pipeline that resulted in the correct value coming out. | |||
Correctness | Use a method to calculate correct output based on input that is distinct from our pipeline itself and this is good input | The proportion of records coming into the pipeline that resulted in the correct value coming out. | |||
Coverage (general) | exports the number of records that it should have processed and the number of records that it successfully processed. This metric may miss records that our pipeline did not know about due to misconfiguration. | For batch processing, the proportion of jobs that processed above some target amount of data. For streaming processing, the proportion of incoming records that were successfully processed within some time window. | |||
Throughput | % | Traffic, Kbps / Bandwidth, Kbps | 1min | Average, Max | |
Volume | number of records proceeded | 1min | Average, Max |
As Simple as Possible, No Simpler
Therefore, design your monitoring system with an eye toward simplicity. In choosing what to monitor, keep the following guidelines in mind:
- The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
- Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
- Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
To summarize:
- Monitoring is a key component of reliability of any IT system
- Define correct SLI based on specific and type of your system
- SLOs are the tool by which you measure your service’s reliability.
I hope, you like my research and glad to share all of this information with your. For more SRE content please subscribe to our newsletter, follow us on Twitter and LInkedIn and check our architecture board if not done yet.
Save your privacy, bean ethical!