Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that needs to ship continuously and the operations team that’s responsible for the reliability of the production environment. Site reliability engineering shifts the responsibility of production reliability to the SRE on the development team.
Site reliability engineers typically spend up to 50% of their time on the daily tasks that keep the application reliable and the rest of their time developing software.
A key skill of a software reliability engineer is that they have a deep understanding of the application. This includes knowledge of the code, how the application runs, how it’s configured, and how it scales. Some of the typical responsibilities of a site reliability engineer are to:
- Service Level Objectives (SLO) are another foundational conceptual unit for SRE. SLO attempts to disentangle indicators from objectives from agreements, examines how SRE uses each of these terms, and provides some recommendations on how to find useful metrics for your own applications.
- Eliminating toil is one of SRE’s most important tasks, and is the subject of Eliminating Toil. They define toil as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth.
- Monitoring is an absolutely essential component of doing the right thing in production. If you can’t monitor a service, you don’t know what’s happening, and if you’re blind to what’s happening, you can’t be reliable. Good reliability engineer must proactively monitor and review application performance. Ensure that the software has good logging and diagnostics.
- Handle on-call and emergency support and help triage escalated support tickets.
- Create and maintain operational runbooks. Prioritize automation under manual work.
- Contribute to the overall product roadmap. Perform live site reviews and capture feedback for system outages. Conducting blamless postmortem after each incident.
Site reliability engineering versus DevOps
Both SRE and DevOps are methodologies that address an organization’s need for a way to manage the production environment. In our earlier post, we have already listed the most essential DevOps skills.
Development and implementation
DevOps is about core development. SRE is about implementing the core. What does that mean?
DevOps teams are focused on core development. They are working on a product or application that is the solution to someone’s problem. They are taking an agile approach to software development that helps them build, test, deploy and monitor applications with speed, quality and control.
SREs are working on the implementation of the core. They are constantly giving feedback back into that core development group to say “Hey, something that you guys have designed isn’t working exactly the way that you think that it is.” SRE leverages operations data and software engineering to automate IT operations tasks and accelerate software delivery, while minimizing IT risk.
Skills
There are different skill sets between DevOps and SREs. Core development DevOps are the guys that love writing software. They are writing code and testing it and pushing it out into production to get an application line to help solve a problem.
SREs are more investigative. They are willing to do the analysis to find why something has gone wrong. They want to ensure that the same problems don’t keep happening. They want to be proactive in their efforts, not reactive. They want to automate repetitive tasks so they can innovate.
Automation
Sometimes, there just isn’t enough time to do everything manually, regardless of your role. Sometimes you need to find ways to automate things so that you can focus your time and energy on the innovation. You don’t have to automate everything; But, if you are constantly doing the same task over and over, why not use automation to reduce the toil? Automation is key.
DevOps is going to automate deployment. They’re going to automate tasks and features. SRE is going to automate redundancy, and they’re going to automate manual tasks that they can turn into programmatic tasks to keep the stack up and running.
SRE skills
The type of skills that are needed vary depending on the application, how and where it’s deployed, and how it’s monitored. For example, organizations that use serverless technologies won’t need someone with in-depth knowledge of Windows or Linux systems management. However, these skills are critical to teams that use servers for deployments.
Other key skills for a good SRE focus on application monitoring and diagnostics. An SRE should have experience with application performance management tools like Application Insights. They should also understand application logging best practices and exception handling.
Monitoring and version control tools
Version control tools are complementary to coding. Git and Github are the two main version control tools that you can master. Site reliability engineers also benefit from knowing how to use monitoring tools like Grafana, Loki and Prometheus. Monitoring tools give you a quick summary of the system’s efficiency and performance. Site reliability engineers often have prior expertise in implementing both version control tools and monitoring tools.
Containers and orchestration
Good reliability engineer must understand how docker containers work and the performances of Kubernetes clusters. Handle Docker images, how to manage the Docker registry, create your own Docker image with Dockerfile. SRE should master the main Kubernetes objects: namespaces, pods, services and endpoints, Ingress.
Tools like Helm, Kustomize, Flux, ArgoCD are essential part of daily basis work of any SRE.
Coding languages
Even if they don’t have to code a website from beginning to end, proficiency in one or two of the main coding languages, like Java, .Net, Python or PowerShell, can be valuable. Knowing how to code helps site reliability engineers edit websites’ to promote reliability and improve the user experience. The more languages you know, the more marketable you may be as a candidate.
Release managment and CI/CD pipeline
CI/CD is an abbreviation for continuous integration/continuous development. CI/CD pipelines connect development teams and activities to their operational counterparts. This technical skill allows your ability to adapt and shows you can bridge the needs of the user base with management goals.
Cloud and Distributed computing
Having experience with distributed computing systems could distinguish you as a site reliability engineer. Distributed computing systems depend on devices that are connected to and communicate with each other over a shared network. Many employers may want you to know how to handle these complex systems so you can ensure the sharing of information between the company’s network systems or employees.
The reason why SRE needs cloud is: Around 90% of businesses use the cloud, and you can’t manage reliability for cloud environments very well if you don’t understand cloud architectures, cloud networking, cloud data storage, cloud observability and so on.
Networking
The network plays a pivotal role in connecting modern, distributed environments. Situations like this are why SREs should master networking concepts. Even if their organization also employs networking engineers, site reliability engineers need a deep understanding of networking themselves to know when the network is the root cause of an incident and how to resolve network-caused issues effectively.
Security
Securing is another domain that SREs don’t “own,” but where they nonetheless require significant skills. Indeed, good reliability engineering makes security a priority, and vice versa. SREs who don’t understand security fundamentals are at risk of implementing reliability solutions that are effective from a reliability standpoint, but not necessarily secure.
Operating systems, virtualization and databases
Site reliability engineers spend a lot of time working on various operating systems and databases. Knowing the characteristics of common systems is a marketable skill, as is database management. Data models are an integral part of their work, so it makes sense for you to be familiar with relational and non-relational databases.
Automation skills
Automation skills reduce the need for manual work, which improves efficiency. Site reliability engineers can automatically program repetitive tasks and give themselves more time to do high-value work that requires intellectual input. Potential employers may prize site reliability engineers who have strong automation skills and can save companies time and money.
Precise communication
Site reliability engineers navigate between the technical and business worlds. Fluency in the correct terminology is important for conveying concepts directly and precisely. For example, if a customer requests certain features and functionality, the site reliability engineer has to translate that request into technical know-how before implementing it. Site reliability engineers also communicate with fellow team members and explaining an issue clearly from the outset can help to ensure a quick resolution.
Perhaps the single most important type of skill for SREs to learn is incident management. Although many roles may participate in incident response, SREs usually take the lead in organizing the incident response team, communicating with stakeholders and devising the best strategy for resolving each incident as quickly as possible.
In addition to overseeing incident response, SREs may be tasked with managing post mortems. Knowing how to run a postmortem—as well as when a postmortem is necessary, and when it makes sense to use a “blameless” postmortem approach—is an essential SRE skill.
Problem-solving
Troubleshooting is one of a site reliability engineer’s major responsibilities. Being a successful troubleshooter is about having strong analytical skills that enable you to recognize the problem, devise a solution, and execute that solution. Creative site reliability engineers can find unique and innovative ways to solve problems.
Monitoring and alerting
A key principle of any effective reliability engineer is good and reactive monitoring and alerting. Monitoring and alerting enables a system to tell people when it’s broken, or perhaps to tell them what’s about to break. If someone needs to investigate the problem, the alert should give relevant information so the person knows where to start.
When you review existing alerts or write new alert rules, consider these guidelines to keep your alerts relevant and your on-call rotation happier:
- Alerts that trigger a human’s attention should be urgent, important, action-oriented, and real.
- Alerts should represent either ongoing or imminent problems with your service.
- Remove noisy alerts. Over-monitoring is a harder problem to solve than under-monitoring.
- Classify the problem into one of these ‘golden rules’:
- Availability and basic functionality.
- Latency.
- Correctness.
- Feature-specific problems.
- Symptoms are a better way to capture problems comprehensively and robustly with less effort.
- Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
- The further up your serving stack you go, the more distinct problems you catch in a single rule. But don’t go so far that you can’t sufficiently distinguish what’s going on.
- If you want a quiet on-call rotation, have a system for dealing with issues that need a timely response but aren’t imminently critical.
Symptom-based monitoring versus Cause-based monitoring
The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause. Table below lists some hypothetical symptoms and corresponding causes. From ref.
Symptom | Cause |
I’m serving HTTP 500s or 404s | Database servers are refusing connections |
My responses are slow | CPUs are overloaded, or an Ethernet cable is crimped under a rack, visible as partial packet loss |
Users in Antarctica aren’t receiving animated GIFs | Your Content Distribution Network hates scientists and felines, and thus blacklisted some client IPs |
Private content is world-readable | A new software push caused ACLs to be forgotten and allowed all requests |
Establish a baseline
An effective alerting strategy starts with establishing a baseline. The baseline calculations of statistics and constraints are needed as a standard against which data drift and other data quality issues can be detected. I propose the next sequencies of steps:
- Starts small.
- Under normal conditions monitor CPU/memory size/IO to establish a baseline that can be used to detect performance issues early.
- Investigating the values from the measurements to understand the pattern over time.
- Once the pattern is understood, it can be used to establish a baseline value for the parameter usage.
- The baseline value is then used to establish a usage threshold and configure reactive monitoring rule (alert)
- Once the alert is received, investigate the queue growth before attempting to remedy the condition.
I hope, you like my research and glad to share all of this information with your. For more SRE content please subscribe to our newsletter, follow us on Twitter and LInkedIn and check our architecture board if not done yet.
Save your privacy, bean ethical!