Why we need establish a good product observability? How the monitoring impact on product reliability. We will explore the significance of the four golden signals in measuring the system’s performance and reliability.
If you’ve ever worked with on-premises environments, you know that you can physically touch the servers. If an application becomes unresponsive, someone can physically determine why that happened. In the cloud though, the servers aren’t yours—they’re provider’s—and you can’t physically inspect them. So the question becomes, how do you know what’s happening with your server, or database, or application?
The answer is by using observability tools:
- Visibility into system allows us understand what is happening with their application and system. It answers questions such as “are my systems functioning?” or “”do my systems have sufficient resources available?
- Error reporting and alerting: Users want to monitor their service at a glance through healthy/unhealthy status icons or red/green indicators. Customers appreciate any proactive alerting, anomaly detection, or guidance on issues. Ideally, they want to avoid connecting the dots themselves.
- Efficient troubleshooting: Users don’t want multiple tabs open. They need a system that can proactively correlate relevant signals and make it easy to search across different data sources, like logs and metrics. If possible, the service needs to be opinionated about the potential cause of the issue and recommend a meaningful direction for the customer to start their investigation.
- Performance improvement: Users need a service that can perform retrospective analysis. Generally, help them plan intelligently by analyzing trends and understand how changes in the system affect its performance.
Monitoring
Monitoring is the foundation of product reliability. It reveals what needs urgent attention and shows trends in application usage patterns, which can yield better capacity planning, and generally help improve an application client’s experience, and lessen their pain.
In Google’s Site Reliability Engineering book, which is available to read at landing.google.com/sre/books, monitoring is defined as: “Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.” An application client normally only sees the public side of a product, and as a result, developers and business stakeholders both tend to think that the most crucial way to make the client happy is by spending the most time and effort on developing that part of the product.
However, to be truly reliable, even the very best products still must be deployed into environments with enough capacity to handle the anticipated client load. Great products also need thorough testing, preferably automated testing, and a refined continuous integration/continuous development (CI/CD) release pipeline.
Postmortems and root cause analyses are the DevOps team’s way of letting the client know why an incident happened and why it is unlikely to happen again. In this context we are discussing a system or software failure, but the term “incident” can also be used to describe a breach of security.
Transparency here is key to building trust:
- We need our products to improve continually, and we need monitoring data to ensure that happens.
- We need dashboards to provide business intelligence so our DevOps personnel have the data they need to do their jobs.
- We need automated alerts because humans tend to look at things only when there’s something important to look at.
An even better option is to construct automated systems to handle as many alerts as possible so humans only have to look at the most critical issues.
Typically, there’s some triggering event: a system outage, data loss, a monitoring failure, or some form of manual intervention. The trigger leads to a response by both automated systems and DevOps personnel. Many times the response starts by examining signal data that comes in through monitoring. The impact of the issue is evaluated and escalated when needed, and an initial response is formulated. Throughout, good SREs will strive to keep the customer informed and respond when appropriate.
Finally, we need monitoring tools that help provide data crucial to debugging application functional and performance issues.
“Four golden signals”
There are “four golden signals” that measure a system’s performance and reliability. They are latency, traffic, saturation, and errors.
Latency measures how long it takes a particular part of a system to return a result. Latency is important because: It directly affects the user experience. Changes in latency could indicate emerging issues. Its values may be tied to capacity demands. It can be used to measure system improvements.
But how is it measured? Sample latency metrics include: Page load latency Number of requests waiting for a thread Query duration Service response time Transaction duration Time to first response Time to complete data return The next signal is traffic, which measures how many requests are reaching your system.
Traffic is important because: It’s an indicator of current system demand. Its historical trends are used for capacity planning. It’s a core measure when calculating infrastructure spend. Sample traffic metrics include: # HTTP requests per second # requests for static vs. dynamic content Network I/O # concurrent sessions # transactions per second # retrievals per second # active requests # write ops # read ops And # active connections The third signal is saturation, which measures how close to capacity a system is.
It’s important to note, though, that capacity is often a subjective measure, that depends on the underlying service or application.
Saturation is important because: It’s an indicator of current system demand. In other words, how full the service is. It focuses on the most constrained resources. It’s frequently tied to degrading performance as capacity is reached. Saturation measures how close to capacity a system is.
It’s important to note, though, that capacity is often a subjective measure, that depends on the underlying service or application.
Sample capacity metrics include: % memory utilization % thread pool utilization % cache utilization % disk utilization % CPU utilization Disk quota Memory quota # of available connections And # of users on the system The fourth signal is errors, which are events that measure system failures or other issues.
Errors are often raised when a flaw, failure, or fault in a computer program or system causes it to produce incorrect or unexpected results, or behave in unintended ways. Errors might indicate: Configuration or capacity issues Service level objective violations That it’s time to emit an alert Sample error metrics include: Wrong answers or incorrect content # 400/500 HTTP codes # failed requests # exceptions # stack traces Servers that fail liveness checks And # dropped connections The fourth signal is errors, which are events that measure system failures or other issues.
Now, let’s return to the observability concept.
SLI, SLO and SLA
The three terms that are frequently used in this course are SLI, SLO and SLA. Before, we already discussed what is MTTR and this is so important.
Service level indicators, or SLIs, are carefully selected monitoring metrics that measure one aspect of a service’s reliability. Ideally, SLIs should have a close linear relationship with your users’ experience of that reliability, and we recommend expressing them as the ratio of two numbers: the number of good events divided by the count of all valid events.
A Service level objective, or SLO, combines a service level indicator with a target reliability. If you express your SLIs as is commonly recommended, your SLOs will generally be somewhere just short of 100%, for example, 99.9%, or “three nines.” You can’t measure everything, so when possible, you should choose SLOs that are S.M.A.R.T. SLOs should be specific. “Hey everyone, is the site fast enough for you?” is not specific; it’s subjective. “The 95th percentile of results are returned in under 100ms.”
That’s specific. They need to be based on indicators that are measurable. A lot of monitoring is numbers, grouped over time, with math applied. An SLI must be a number or a delta, something we can measure and place in a mathematical equation. SLO goals should be achievable. “100% Availability” might sound good, but it’s not possible to obtain, let alone maintain, over an extended window of time.
- SLOs should be relevant.
- Does it matter to the user?
- Will it help achieve application-related goals?
- If not, then it’s a poor metric.
- And SLOs should be time-bound.
- You want a service to be 99% available?
- Is that per year?
- Per month?
- Per day?
- Does the calculation look at specific windows of set time, from Sunday to Sunday for example, or is it a rolling period of the last seven days?
- If we don’t know the answers to those types of questions, it can’t be measured accurately.
And then there are Service Level Agreements, or SLAs, which are commitments made to your customers that your systems and applications will have only a certain amount of “down time.” An SLA describes the minimum levels of service that you promise to provide to your customers and what happens when you break that promise. If your service has paying customers, an SLA may include some way of compensating them with refunds or credits when that service has an outage that is longer than this agreement allows.
To give you the opportunity to detect problems and take remedial action before your reputation is damaged, your alerting thresholds are often substantially higher than the minimum levels of service documented in your SLA.
For SLOs, SLIs and SLAs to help improve service reliability, all parts of the business must agree that they are an accurate measure of user experience and must also agree to use them as a primary driver for decision making. Being out of SLO must have concrete, well-documented consequences, just as there are consequences for breaching SLAs.
For example, slowing down the rate of change and directing more engineering effort towards eliminating risks and improving reliability are actions that could be taken to get your product back to meeting its SLOs faster. Operations teams need strong executive support to enforce these consequences and effect change in your development practice. Here is an example of a SLA, which is to maintain an error rate of less than 0.3% for the billing system. Here error rate is a quantifiable measure which is the SLI and 0.3 is the specific target set which is the SLO in this case.
If your service has paying customers, you probably have some way of compensating them with refunds or credits when that service has an outage. Your criteria for compensation are usually written into a service level agreement, which describes the minimum levels of service that you promise to provide and what happens when you break that promise.
The problem with SLAs is that you’re only incentivized to promise the minimum level of service and compensation that will stop your customers from replacing you with a competitor. When reliability falls far short of the levels of service that keep your customers happy, this contributes to a perception that your service is unreliable, customers often feel the impact of reliability problems before these promises are breached.
You should answer are questions like:
- Compensating your customers all the time can get expensive, so what targets do you hold yourself to internally?
- When does your monitoring system trigger an operational response?
To give you the breathing room to detect problems and take remedial action before your reputation is damaged, your alerting thresholds are often substantially higher than the minimum levels of service documented in your SLA.
Observability starts with signals, which are metric, logging, and trace data captured and integrated into products from the hardware layer up. Services can be monitored for compliance with service level objectives (SLOs), and error budgets can be tracked. Health checks can be used to check uptime and latency for external-facing sites and services.
Hope you like the post. For more SRE content please subscribe to our newsletter, follow us on Twitter and LinkedIn.
Save your privacy, bean ethical!