System Reliability: implementing ‘golden metrics’

Before we start lets think first what is a system reliability means. In simple words, this is the probability of a product performing its intended function under stated conditions without failure for a given period of time. It means, among other things, continuous monitoring of the state of the system. Why this is so important and how to make system be reliable? Today, we will try to answer these two questions by defining a ‘golden metrics’ which help to track basic Service level objectives (SLOs) in a distributed IT system.

Request-driven metrics
Storage
Pipeline
As Simple as Possible, No Simpler

Based Google SRE workbook, in distributed system will able to define the four golden rules:

Availability. The amount of time or as a percentage, that an system is able to fulfill its intended function.
Latency. The time it takes to service a request. Frequently, latency is a synonym for delay.
Traffic. A measure of how much demand is being placed on your system. For a web service, this measurement is usually HTTP requests per second.
Success/Fail rate. The rate of requests that success or fail.
Saturation. How “full” your service is.

Measuring only these four characteristics of a system will provide us with all necessary information about reliability.

But not all systems are the same. Before we start defining ‘golden metrics’, lets describe a types of components against which we will construct our rules.

Common three types of components are:

Request-driven. The user creates some type of event and expects a response. For example, this could be an HTTP service where the user interacts with a browser or an API for a mobile application.
Pipeline or background process. A system that takes records as input, mutates them, and places the output somewhere else.
Storage. A system that accepts data and makes it available to be retrieved later.

Now, we know the basics and it is time to describe ‘golden metrics’ based on type of a IT system:

Type	Request-driven	Request-driven	Storage	Request-driven	Pipeline	Pipeline
Example	User-faced web application, SaaS/PaaS	API, Serverless function	Database, Blob storage	Mobile application	CI/CD pipeline	Backend process
Rule	Availability, [% time]	Availability, [% time]	Availability, [% time]	Availability, [% time]	Pass rate, [%]	Availability, [% time]
Rule	Latency, [ms]	Latency, [ms]	Latency, [ms]	Latency, [ms]	Duration. [sec]	Latency, [ms]
Rule	Traffic, [req/s]	Traffic, [req/s]	Durability	Performance	Test pass rate, [%]	Correctness, [%]
Rule	Success rate, [%]	Success rate, [%]	Saturation, [%]	Responsiveness	Code coverage, [%]	Throughput, [% Kbsp]
Rule	Success rate, [%]	Success rate, [%]	Deadlock, [count/min]			Coverage
Rule	Saturation, [%]	Throughput, [% Kbsp]	Read/Write rate			Volume
Rule	Correctness, [%]
Rule	Quality

Request-driven metrics

This is very important measuring metrics in correct way. I would like to start with defining correct indicators first (SLI).

Each system is unique from architecture point of view and you have to adjust rules based on it. Below the example how to define and implement metrics based on defined SLO:

SLI	Unit	Calculation	Interval	Aggregation	SLO	Measurement
Availability – Success rate	% req	count(req == ‘2XX’) / total(req)		Average [7d, 30d]	99,5%	The proportion of requests that resulted in a successful response. HTTP 200 / total req
Availability	% time	sum(uptime) / sum(total time)	5min	Average [7d, 30d]	99,5%	How to measure uptime? PaaS or custom health check
Latency	% req	count(req == ‘2XX’ && RPC < SLO bucket 1) / total(req)	5min	Average [7d, 30d]	90%	The proportion of requests that were faster than some threshold: SLO bucket 1 = 100ms
Latency	% req	count(req == ‘2XX’ && RPC < SLO bucket 2) / total(req)	5min	Average [7d, 30d]	95%	The proportion of requests that were faster than some threshold: SLO bucket 2 = 300ms
Latency	% req	count(req == ‘2XX’ && RPC < SLO bucket 3) / total(req)	5min	Average [7d, 30d]	99%	The proportion of requests that were faster than some threshold: SLO bucket 3 = 1s
Traffic	req/7d, 30d	total(req)		7d, 30d		Total request per week/month
Traffic	req/sec	count(req) / Period		Average, Max [1sec, 1 min, 1h]		Define peak capacity periods: start of week/end of week/month
Saturation, CPU	%	current usage / max amount	5min	Average, Max [7d, 30d]	95%	CPU usage Many systems degrade in performance before they achieve 100% utilization
Saturation, RAM	%	current usage / max amount	5min	Average, Max [7d, 30d]	95%	RAM usage Many systems degrade in performance before they achieve 100% utilization
Fail rate	% req	count(req == ‘5XX’) / total(req)	5min	Average [7d, 30d]	0,5%	HTTP 5XX / total req
Quality	% req	count(req == ‘2XX’) / total(req in undegraded state)	5min	Average [7d, 30d]	0,5%	– when service or backends is unavailable, you need to measure the proportion of responses that were served in an undegraded state – graceful shutdown – better user experience

Some of these SLIs may overlap: a request-driven service may have a correctness SLI, a pipeline may have an availability SLI, and durability SLIs might be viewed as a variant on correctness SLIs. I recommend choosing a small number (five or fewer) of SLI types that represent the most critical functionality to your customers.

Storage

This kind of workloads are characterized of durability and availability in most cases. The proportion of records written that can be successfully read.

Different aspects of a system should be measured with different levels of granularity. For example:

Observing CPU load over the time span of a minute won’t reveal even quite long-lived spikes that drive high tail latencies.
On the other hand, for a web service targeting no more than 9 hours aggregate downtime per year (99.9% annual uptime), probing for a 200 (success) status more than once or twice a minute is probably unnecessarily frequent.
Similarly, checking hard drive fullness for a service targeting 99.9% availability more than once every 1–2 minutes is probably unnecessary.

SLI	Unit	Calculation	Time period	Aggregation	SLO	Measurement
Availability	%	uptime / total time		Average [7d, 30d]	99%
Latency		RPC< 100ms		Average [7d, 30d]	100ms
Durability	%					The proportion of records written that can be successfully read.
Saturation, CPU	%	current usage / max amount	5min, 60 min, 1 day	Average, Max	95%
Saturation, RAM	%	current usage / max amount	5min, 60 min, 1 day	Average, Max	95%
Saturation, I/O	% DTU	current usage / max amount	5min, 60 min, 1 day	Average, Max	95%
Saturation, free space	%	current usage / max amount	5min, 60 min, 1 day	Average, Max	95%
Throughput	%	Traffic, Kbps / Bandwidth, Kbps	5min		95%
Deadlock	number/time	number of deadlocks/hour	5min, 1h			– Number of deadlocks in period of time – Number of deadlocks per 1000 requests
Read	number		1sec	Average, Max
Write	number		1sec	Average, Max

Pipeline

This might be a simple process that runs on a single instance in real time, or a batch process that takes many hours. Examples include:

A system that periodically reads data from a relational database and writes it into a distributed hash table for optimized serving
A video processing service that converts video from one format to another
A system that reads in log files from many sources to generate reports
A monitoring system that pulls metrics from remote servers and generates time series and alerts

A good example of pipeline process is CI/CD.

SLI	Unit	Calculation	Time period	Aggregation	Measurement
Pass rate	%	number of success run/total number of runs	30d		Commonly CI/CD tool provides such metrics put-of-the-box
Duration	time	time to run	30d		Commonly CI/CD tool provides such metrics put-of-the-box
Test pass rate	%	number of test failed/total number of runs	30d		Commonly CI/CD tool provides such metrics put-of-the-box
Code coverage	%	code coverage	30d		Commonly CI/CD tool provides such metrics put-of-the-box
Freshness		count(record in last 10 days) / total(record)			The proportion of the data that was updated more recently than some time threshold. Ideally this metric counts how many times a user accessed the data, so that it most accurately reflects the user experience.
Correctness		Inject data with known outputs into the system, and count the proportion of times that the output matches our expectations			The proportion of records coming into the pipeline that resulted in the correct value coming out.
Correctness		Use a method to calculate correct output based on input that is distinct from our pipeline itself and this is good input			The proportion of records coming into the pipeline that resulted in the correct value coming out.
Coverage (general)		exports the number of records that it should have processed and the number of records that it successfully processed. This metric may miss records that our pipeline did not know about due to misconfiguration.			For batch processing, the proportion of jobs that processed above some target amount of data. For streaming processing, the proportion of incoming records that were successfully processed within some time window.
Throughput	%	Traffic, Kbps / Bandwidth, Kbps	1min	Average, Max
Volume		number of records proceeded	1min	Average, Max

As Simple as Possible, No Simpler

Therefore, design your monitoring system with an eye toward simplicity. In choosing what to monitor, keep the following guidelines in mind:

The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal.
Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.

To summarize:

Monitoring is a key component of reliability of any IT system
Define correct SLI based on specific and type of your system
SLOs are the tool by which you measure your service’s reliability.

I hope, you like my research and glad to share all of this information with your. For more SRE content please subscribe to our newsletter, follow us on Twitter and LInkedIn and check our architecture board if not done yet.

Save your privacy, bean ethical!

System Reliability: implementing ‘golden metrics’

Request-driven metrics

Storage

Pipeline

As Simple as Possible, No Simpler

subscribe to newsletter

Related

Leave a Comment Cancel reply

Request-driven metrics

Storage

Pipeline

As Simple as Possible, No Simpler

subscribe to newsletter

Related

Leave a Comment Cancel reply

Discover more from #cybertechtalk