Earlier I was trying to describe you who is really system reliability engineer and what is he responsible for. Today, I would like to focus on a set of skills and approaches, any SRE need to follow.
To understand how really your system behave, how reliable is it and what is best way to adopt common SRE practices, I will list the TOP 10 of the most important practices, which any reliability engineer should have on his checklist.
1. Incident Management
- There must be a well defined process to how incidents are managed.
- Where incidents are reported/raised?
- Who monitors the alerts at any given point of time?
- Is there any team that gets the alerts before the SRE team and tries to handle the issue?
- Is the process automated?
- Do you need to manually open a ticket?
- Do you need to go to incidents platform/page or do you get an alert?
- Do you schedule a meeting to talk about
- Create incident management flow. Be sure everyone is familiar with it.
- Establish incident management and response team.
2. Define SLO (Service Level Objective)
- Describe golden rules and define the most valuable indicators (SLI)
- Define critical objectives for indicators (SLO)
- These is no identical systems. Try to understand it architecture first.
3. Provisioning
- How do you provision the infrastructure required for deploying the application? (Terraform, Pulumi, CloudFormation, …)
- How to install the application and its dependencies? (Container, Bash Script, Ansible, …)
- How to deploy the application or the application service? (k8s, cloud instances)
- How to configure the application? (k8s, Ansible, …)
4. Resiliency and disaster recovery
- Is there single point of failure?
- Your app is able to withstand outages (usually by implement multi-region or multi-cloud architectures)
- Your app will scale up and down in response to load change
- Resources deployed across availability zones, regions, etc.
- Is there disaster recovery (DR) plan?
- Do your team exercise DR regularly?
5. GitOps
- Have you adopt GitOps?
- How do you monitor Flux reconsilation?
- Do you use ArgoCD dashboards and notifications?
- Use kustomize and Helm manifest deployment.
- Auto Prune (resources deleted when files/content deleted)
- Self-heal (cluster state corrected based on Git state and when manual changes done to the cluster)
6. Monitoring and alerting
- Choose monitoring solution. Prometheus/Grafana/Loki or ELK stack
- If you can afford it, consider going for ready monitoring solutions like DataDog, NewRelic, …
- Be aware of maintenance and how much time you are willing to invest in developing and maintaining monitoring solution
- Be sure that critical alerts defined
- Team gets notifications on critical alerts (Slack, Phone, Email, … perhaps all. whatever works best for the team)
- Implement reactive monitoring – alerting
- Considering automatic ticket creation in ITSM or firing runbook on critical alert
- Continuously improve your monitoring. Establish baseline and observe deviations.
- Use Dashboards
7. IaC and CI/CD
- Choose one solution if posible.
- Follow DRY (Don’t Repeat Yourself) principle as in make sure there are no code duplication
- Readable code – use naming conventions, formatting
- Use pull requests for infrastructure. Treat infrastructure code as your application code
- Consider inserting cost considerations (e.g. test whether a change will raise the bill significantly if you are using a public cloud)
- Make any changes ONLY with new commit, No manual interaction.
8. Security
- Do not store credentials in plane text
- Use principal of least privileges
- Encrypt your data in rest and transit
- Use service account to connect to cloud resources
9. Leadership
- Set goals. Define SLO/SLI/SLA
- Setup KPI and conduct regular meetings to monitor.
- 100% reliability is not a good goal!
- Is there an onboarding page for SREs joining the team?
- Schedule 1:1 meeting with team (probably…manager or lead?
- Identify possible gaps.
- Eliminate toil
- Does development team waits on SRE for infra related operations?
- Identify SRE team maturity and work on improving it
- Step 1: Operations: SRE is focused on resolving issues, dealing with requests
- Step 2: Automation: SRE is moving towards automation and self-service. Providing tooling, documentation, etc.
- Step 3: Product: SRE is focused on improving the product itself – reliability, performances, etc.
- Keep learning ALL THE TIME
10. Post incident analysis
- Use root cause analysis meeting after each incident
- Note everything
- Promote postmortem analysis and blameless culture
Many thanks for Arie Bregman for awesome checklist. Many thanks everyone, who comment me on LinkedIn. It really help us to grow and create more interesting content.
For more SRE content please subscribe to our newsletter, follow us on Twitter and check our sre posts if not done yet.
Save your privacy, bean ethical!