With a set of requirements in place, we will now move on to consider how to measure whether the technical and business requirements have been met. To manage a service well, it is important to understand which behaviors matter, and how to measure and evaluate these behaviors. These must always be considered in the context of the constraints, which are usually time, funding and people. A common way to measure success is to use key performance indicators (KPI).
KPIs can be categorized as business KPIs and technical KPIs:
- Business KPIs are a formal way of measuring what the business values, such as ROI, in relation to a project or service. Others include earnings before interest and taxes, or impact on users, such as customer churn, or maybe employee turnover.
- Technical or software KPIs can consider aspects such as how effective the software is through page views, user registration and number of checkouts. These KPIs should also be closely aligned with business objectives.
As an architect, it is important that you understand how the business measures success of the systems that you design. Now, a KPI is not the same thing as a goal or objective. The goal is outcome or result you want to achieve. The KPI is a metric that indicates whether you are on track to achieve the goal. To be the most effective, KPIs need an accompanying goal. This should be the starting point in defining KPIs. Then for each goal, define the KPIs that will allow you to monitor and measure progress. For each KPI, define targets for what success looks like. Monitoring KPIs against goals is important to achieving success and allows readjustment based on feedback. As an example, a goal may be to increase turnover for an online store, and an associated KPI may be the percentage of conversions on the website.
It is important to evaluate what kind of system is. The type of system being evaluated determines the data that can be measured.
For example, for user-facing systems, was a request responded to, which refers to availability, how long did it take to respond, which refers to latency, how many requests can be handled, which refers to throughput?
For data storage systems, how long does it take to read and write data? That’s latency. Is the data there when we need it? That’s availability. If there is a failure, do we lose any data? That’s durability.
The key to all of these items is that the questions can be answered with data gathered from the services. Business decision makers want to measure the value of projects. This enables them to better support the most valuable projects and not waste resources on those that are not beneficial.
Let KPIs be S.M.A.R.T.
For KPIs to be effective, they must be specific rather than general. For example, user friendly is not specific. It’s very subjective. Measurable is vital because monitoring the KPIs indicates whether you’re moving toward or away from your goal. Being achievable is also important. For example, expecting 100 percent conversions on a website is not achievable. Relevant is absolutely vital. Without a relevant KPI, the goal probably will not be met. In our example of increasing turnover, if we’re improving the conversion rate, a subsequent increase in turnover should be achievable assuming a similar number of users. Time-bound helps with measuring the KPI.
Define right SLI, SLO, SLA
Let’s introduce service level terminology. To provide a given level of service to customers, it is important to define service level indicators, or SLIs, objectives, or SLOs, and agreements, or SLAs. These are measurements that describe basic properties of the metrics to measure, the values those metrics should read and how to react if the metrics cannot be met.
Service level indicator (SLI) is a quantitative measure of some aspect of the level of service being provided. Examples include throughput, latency and error rate. Service level objective is an agreed-upon target or range of values for a service level that is measured by an SLI. It is normally stated in the form of SLI is smaller than equal to target or lower bound smaller and equal to SLI, smaller or equal to upper bound.
An example of an Service level objective (SLO) is that at average latency of HTTP requests for our service should be less than 100 milliseconds.
Service level agreement (SLA) is an agreement between a service provider and a consumer. They define the responsibilities for delivering a service, and consequences when these responsibilities are not met. The SLA is a more restrictive version of the SLO.
We want to architect a solution and maintain an agreed SLO so that we provide ourselves spare capacity against the SLA. Understanding what users want from a service will help inform the selection of indicators. The indicators must be measurable. For example, fast response time is not measurable, whereas HTTP GET requests that respond within 400 milliseconds aggregated per minute is clearly measurable. Similarly, highly available is not measurable, but percentage of successful requests over all requests aggregated per minute is measurable. Not only must indicators be measurable, but the way they are aggregated needs careful consideration. For example, consider requests per second to a service.
How is the value calculated? By measurements obtained once per second, or by averaging requests over a minute? The once per second measurement may hide high request rates that occur in bursts of a few seconds. For example, consider a service that receives 1,000 requests per second on even-numbered seconds and zero requests on odd-numbered seconds. The average request per second could be reported over a minute as 500. However, the reality is that the load at times is twice as large as the average. Similar averages can mask user experience when used for metrics like latency. It can mask the requests that take a lot longer to respond than the average. It is better to use percentiles for such metrics where a high order percentile such as 99 percent shows worst case values while the 50th percentile will indicate a typical case.
You want objectives that help or improve the user experience. It is easy to define a SLOs based around what is easy to measure rather than what is useful. For clarity, SLOs should specify how they are measured and the conditions when they are valid.
Consider availability as measured with an uptime check over ten seconds aggregated per minute. It is unrealistic as well as undesirable to have SLOs with a 100% target. Such a target results in expensive, overly conservative solutions, that are still unlikely to reach the SLO. It is better to track the rate at which SLOs are missed and work to improve this. In many cases 99% may be good enough availability and be far easier to achieve as well as engineer. It is also highly likely to be much more cost-effective to run. The use case needs to be considered also.
For example, if a HTTP service for photo uploads requires 99% of uploads to be complete within 100 milliseconds aggregated per minute, this may be unrealistic or overkill if the majority of users are using mobile phones. In such a case, an SLO of 80% is much more achievable and good enough. It is often okay to specify multiple SLOs. Consider the following, 99% of HTTP get calls will complete in less than 100 milliseconds. This is a valid SLO, but it may be the case that the shape of the performance curve is important. In this case, the SLO could be written as follows. 90% of HTTP get calls will complete in less than 50 milliseconds, 99% of HTTP get calls will complete in less than 100 milliseconds. And 99.9% of HTTP get calls will complete and less than 500 milliseconds.
Avoid common mistakes
Selecting SLOs has both product and business implications. Often trade-offs need to be made based on constraints such as staff, time to market and funding. As the slide states, the aim is to keep users happy, not to have an SLO that requires heroic efforts to maintain.
Let me give you some tips on selecting SLOs:
- Do not make them too high. It is better to have lower SLOs to begin with and tighten them over time as you learn about the system, instead of defining those that are unattainable and require a significant effort and cost try and achieve. Keep them simple. More complex SLIs can obscure important changes in performance.
- Avoid absolute values. To have a SLO that states 100% availability is unrealistic. Such an SLO increases the time to build, complexity, and cost to operate. And in most cases is highly unlikely to be required.
- Minimize SLOs. A common mistake is to have too many SLOs. The recommendation is to have just enough SLOs to give coverage of the key system attributes. In summary, good SLOs should reflect what the users care about. They work as a forcing function for development teams. A poor SLO will result in a significant amount of wasted work if it is too ambitious or a poor product if it is to relaxed. An SLA is a business contract between the service provider and the customer.
- Not every service has an SLA, but all services should have a SLOs. A penalty will apply if the service provider does not maintain the levels agreed on. As with SLO, it is better to be conservative with SLAs because it is too difficult to change or remove SLAs that offer little value or cause a large amount of work. In addition, because they can have a financial implication through compensation to the customer, setting them too high can result in unnecessary compensation being paid.
- To provide protection and some level of safety, an SLA should have a threshold that is lower than the SLO. This should always be the case.
Let’s consider an example of a service, An SLI, SLO and SLAs for the service:
- The service is an HTTP endpoint accessed using HTTP get. The SLI is the end-to-end latency of successful HTTP responses. That is HTTP-200. These are averaged over one minute.
- The SLO has been agreed that the latency of 99% of the responses must be less than or equal to 200 milliseconds.
- The SLA is set that the user is compensated if the 99th percentile latency exceeds 300 milliseconds. The SLA has clearly built a buffer over the SLO, which means that even if the SLO is exceeded there is some capacity before the SLA is broken. This is the wanted position in the relationship between SLO and SLA.
For more SRE content please subscribe to our newsletter, follow us on Twitter and LinkedIn.
Save your privacy, bean ethical!