We already told a bit about designing reliable systems before. Today, we’ll go over how to design services to meet requirements for availability,
durability, and scalability. We will also discuss how to implement fault-tolerant systems by avoiding single points of failure, correlated failures, and cascading failures. We will see how to avoid overload failures by using design patterns such as the circuit breaker and truncated exponential backoff.
When designing for reliability, consider these key performance metrics:
Availability | Durability | Scalability |
The percent of time a system is running and able to process requests | The odds of losing data because of a hardware or system failure | The ability of a system to continue to work as user load and data grow |
Achieved with fault tolerance. | Achieved by replicating data in multiple zones | Monitor usage |
Create backup systems | Do regular backups | Use capacity autoscaling to add and remove servers in response to changes in load. |
Use health checks | Practice restoring from backups | |
Use clear box metrics to count real traffic success and failure |
Now, let’s look at the most common problems while creating distributed system and way to resolve it.
Single points of failure
Avoid single points of failure by replicating data and creating multiple virtual machine instances. It is important to define your unit of deployment and understand its capabilities. To avoid single points of failure, you should deploy two extra instances, or N + 2, to handle both failure and upgrades. These deployments should ideally be in different zones to mitigate for zonal failures.
Consideration
- Define your unit of deployment
- N+2L: Plan to have one unit out for upgrade or testing and survive another failing
- Make sure that each unit can handle the extra load
- Don’t make any single unit too large
- Try to make units interchangeable stateless clones
Example
Consider 3 VMs that are load balanced to achieve N+2. If one is being upgraded and another fails, 50% of the available capacity of the compute is removed, which potentially doubles the load on the remaining instance and increases the chances of that failing. This is where capacity planning and knowing the capability of your deployment unit is important. Also, for ease of scaling, it is a good practice to make the deployment units interchangeable stateless clones.
Correlated failures
It is also important to be aware of correlated failures. These occur when related items fail at the same time.
Consideration
- If a single machine fails, all requests served by machine fail.
- If a top-of-rack switch fails, entire rack fails.
- If a zone or region is lost, all the resources in it fail.
- Servers on the same software run into the same issue.
- If a global configuration system fails, and multiple systems depend on it, they potentially fail too.
Example
At the simplest level, if a single machine fails, all requests served by that machine fail. At a hardware level, if a top-of-rack switch fails, the complete rack fails. At the cloud level, if a zone or region is lost, all the resources are unavailable. Servers running the same software suffer from the same issue: if there is a fault in the software, the servers may fail at a similar time.
Correlated failures can also apply to configuration data. If a global configuration system fails, and multiple systems depend on it, they potentially fail too. When we have a group of related items that could fail together, we refer to it as a failure or fault domain.
Avoid correlated failures
Several techniques can be used to avoid correlated failures. It is useful to be aware of failure domains; then servers can be decoupled using microservices distributed among multiple failure domains. To achieve this, you can divide business logic into services based on failure domains and deploy to multiple zones and/or regions.
- Decouple servers and use microservices distributed among multiple failure domains.
- Divide business logic into services based on failure domains.
- Deploy to multiple zones and/or regions.
- Split responsibility into components and spread over multiple processes.
- Design independent, loosely coupled but collaborating services.
At a finer level of granularity, it is good to split responsibilities into components and spread these over multiple processes. This way a failure in one component will not affect other components. If all responsibilities are in one component, a failure of one responsibility has a high likelihood of causing all responsibilities to fail.
When you design microservices, your design should result in loosely coupled, independent but collaborating services. A failure in one service should not cause a failure in another service. It may cause a collaborating service to have reduced capacity or not be able to fully process its workflows, but the collaborating service remains in control and does not fail.
Cascading failures
Cascading failures occur when one system fails, causing others to be overloaded, such as a message queue becoming overloaded because of a failing backend.
Example
Cascading failures occur when one system fails, causing others to be overloaded and subsequently fail. For example, a message queue could be overloaded because a backend fails and it cannot process messages placed on the queue.
The graphic on the left shows a Cloud Load Balancer distributing load across two backend servers. Each server can handle a maximum of 1000 queries per second. The load balancer is currently sending 600 queries per second to each instance. If server B now fails, all 1200 queries per second have to be sent to just server A, as shown on the right. This is much higher than the specified maximum and could lead to cascading failure.
Avoid cascading failures
Cascading failures can be handled with support from the deployment platform. For example, you can use health checks in Compute Engine or readiness and liveliness probes in GKE to enable the detection and repair of unhealthy instances. You want to ensure that new instances start fast and ideally do not rely on other backends/systems to start up before they are ready.
- Use health checks in Compute Engine or readiness and liveliness probes in Kubernetes to detect and then repair unhealthy instances.
- Ensure that new server instances start fast and ideally don’t rely on other backends/systems to start up.
Query of death overload
You also want to plan against query of death, where a request made to a service causes a failure in the service. This is referred to as the query of death because the error manifests itself as overconsumption of resources, but in reality is due to an error in the business logic itself.
Problem
Business logic error shows up as overconsumption of resources, and the service overloads. This ‘query of death’ is any request to your system that can cause it to crash. A client may send a query of death, crash one instance of your service, and keep retrying, bringing further instances down.
Solution
Monitor query performance. Ensure that notification of these issues gets back to the developers.
Positive feedback cycle overload
You should also plan against positive feedback cycle overload failure, where a problem is caused by trying to prevent problems.
Example
This happens when you try to make the system more reliable by adding retries in the event of a failure. Instead of fixing the failure, this creates the potential for overload. You may actually be adding more load to an already overloaded system.
Avoid positive feedback overload
Implement correct retry logic.
Truncated exponential backoff pattern
Consideration
If service invocation fails, try again:
- Continue to retry, but wait a while between attempts.
- Wait a little longer each time the request fails.
- Set a maximum length of time and a maximum number of requests.
- Eventually, give up.
Example
- Request fails; wait 1 second + random_number_milliseconds and retry.
- Request fails; wait 2 seconds + random_number_milliseconds and retry.
- Request fails; wait 4 seconds + random_number_milliseconds and retry.
- And so on, up to a maximum_backoff time.
- Continue waiting and retrying up to some maximum number of retries.
Circuit breaker pattern
We told about the pattern implementation for microservices and Kubernetes infrastructure in one of the previous article.
The circuit breaker pattern can also protect a service from too many retries. The pattern implements a solution for when a service is in a degraded state of operation. It is important because if a service is down or overloaded and all its clients are retrying, the extra requests actually make matters worse. The circuit breaker design pattern protects the service behind a proxy that monitors the service health. If the service is not deemed healthy by the circuit breaker, it will not forward requests to the service. When the service becomes operational again, the circuit breaker will begin feeding requests to it again in a controlled manner.
Consideration
- Plan for degraded state operations.
- If a service is down and all its clients are retrying, the increasing number of requests can make matters worse.
- Protect the service behind a proxy that monitors service health (the circuit breaker).
- If the service is not healthy, don’t forward requests to it.
- If using GKE, leverage Istio to automatically implement circuit breakers.
We covered how to deploy our applications for high availability, durability, and scalability. We described the most common design patterns to avoid single point of failure, correlated and cascading failure. How to implement correct retry logic.
Please visit or #CyberTechTalk WIKI pages for much more information about designing reliable systems, monitoring and information security.
– Reliability-as-a-Service
– Monitoring and Observability
– Cloud adoption
– Business Continuity and Disaster Recovery
– Incident Management
– Release Management
– Security
Contact Us for FREE evaluation.
Be ethical, save your privacy!