To consistently address issues raised in your ITSM, you must focus on monitoring, reporting, and reviewing speed of responsiveness. Mean Time to Identify (MTTI) and Mean Time To Resolution (MTTR) are a key indicators that can provide visibility on performance and point to improvements.
What is MTTR?
MTTI is defined as the average time it takes to recognize issues in service or component performance.
MTTR can be divided into 4 component intervals:
- Mean Time To Identify (MTTI): Time period between the start of an incident and the time the incident is detected. The detection may be automatic via events/alarms seen in the event management system.
- Mean Time To Know (MTTK): Time period between the detection of the incident and the time the root cause of the incident is identified.
- Mean Time To Fix (MTTF): Time period between the isolation of the root cause of incident and the time taken to resolve the issue.
- Mean Time To Verify (MTTV): Time period between the resolution of the issue and confirmation of successful resolution from the users or automated tests.
In short, MTTR is the sum of MTTI, MTTK, MTTF and MTTV: MTTR = MTTI + MTTK + MTTF + MTTV (from IMB).
Two common ways to reduce MTTR are:
- UI Consolidation: providing information available with the fewest mouse clicks possible – to ensure the users not lose time looking up information in various tools. I prefer using Grafana, as it has out-of-the-box integration of variety of data sources like Prometheus, Loki, Jeager/Tempo in one single dashboard.
- Tools Consolidation: reducing the number of similar tools. You should consider using single point of glass instead of trying to consolidate all of the infrastructure tools. Grafana or Azure AppInsights provides full stack of technologies to correlate metrics, logs and traces in one place.
Observability pillars
Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
The Three Pillars of Observability are:
- Metric (something is happening) are numerical measurements with attributes that provide an indication of the health of one aspect of a system. In many scenarios, metric collection is intuitive – CPU, memory and disk utilization are obvious natural metrics associated with the health of a system.
- Logs (what is happening) contain granular details of an application’s request processing stages. Exceptions in logs can provide indicators of problems in an application. Monitoring errors and exceptions in logs is an integral part of an observability solution. Parsing logs can also provide insights into application performance.
- Traces (where is it happening) tells the story of a request or data as it propagates through the distributed system. Since distributed tracing connects every request in a transaction, it allows you to know and see what’s happening to every service component and app in production.
The pillars are closely interconnected and co-exist in tight bound (from Grafana Labs):
Grafana-Prometheus-Loki-Jeager stack
Grafana comes with built-in support for many data sources. If you need other data sources, you can also install one of the many data source plugins. If the plugin you need doesn’t exist, you can develop a custom plugin.
Prometheus is an open-source database that uses an telemetry collector agent to scrape and store metrics used for monitoring and alerting. The Prometheus data source also works with other projects that implement the Prometheus querying API: Grafana Mimir or Thanos.
You can define and configure the data source in YAML files as part of Grafana’s provisioning system:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
# Access mode - proxy (server in the UI) or direct (browser in the UI).
url: http://localhost:9090
jsonData:
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
prometheusVersion: 2.44.0
cacheLevel: 'High'
disableRecordingRules: false
incrementalQueryOverlapWindow: 10m
exemplarTraceIdDestinations:
# Field with internal link pointing to data source in Grafana.
# datasourceUid value can be anything, but it should be unique across all defined data source uids.
- datasourceUid: my_jaeger_uid
name: traceID
# Field with external link.
- name: traceID
url: 'http://localhost:3000/explore?orgId=1&left=%5B%22now 1h%22,%22now%22,%22Jaeger%22,%7B%22query%22:%22$${__value.raw}%22%7D%5D'
Loki is a horizontally scalable, highly available, multi-tenant log aggregation system inspired by Prometheus. Loki is built around the idea of only indexing metadata about your logs: labels (just like Prometheus labels). Log data itself is then compressed and stored in chunks in object stores such as S3 or GCS, or even locally on a filesystem:
apiVersion: 1
datasources:
- name: Loki
type: loki
access: proxy
url: http://localhost:3100
basicAuth: true
basicAuthUser: my_user
jsonData:
maxLines: 1000
derivedFields:
# Field with internal link pointing to data source in Grafana.
# datasourceUid value can be anything, but it should be unique across all defined data source uids.
- datasourceUid: my_jaeger_uid
matcherRegex: "traceID=(\\w+)"
name: TraceID
# url will be interpreted as query for the datasource
url: '$${__value.raw}'
# optional for URL Label to set a custom display label for the link.
urlDisplayLabel: 'View Trace'
# Field with external link.
- matcherRegex: "traceID=(\\w+)"
name: TraceID
url: 'http://localhost:16686/trace/$${__value.raw}'
secureJsonData:
basicAuthPassword: test_password
Jeager is the last element of our puzzle, which provides open source, end-to-end distributed tracing. For more information how to setup Jeager data source, please refer to Grafana Jeager.
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
uid: EbPG8fYoz
url: http://localhost:16686
access: proxy
basicAuth: true
basicAuthUser: my_user
readOnly: false
isDefault: false
jsonData:
tracesToLogsV2:
# Field with an internal link pointing to a logs data source in Grafana.
# datasourceUid value must match the uid value of the logs data source.
datasourceUid: 'loki'
spanStartTimeShift: '1h'
spanEndTimeShift: '-1h'
tags: ['job', 'instance', 'pod', 'namespace']
filterByTraceID: false
filterBySpanID: false
customQuery: true
query: 'method="${__span.tags.method}"'
tracesToMetrics:
datasourceUid: 'prom'
spanStartTimeShift: '1h'
spanEndTimeShift: '-1h'
tags: [{ key: 'service.name', value: 'service' }, { key: 'job' }]
queries:
- name: 'Sample query'
query: 'sum(rate(traces_spanmetrics_latency_bucket{$$__tags}[5m]))'
nodeGraph:
enabled: true
traceQuery:
timeShiftEnabled: true
spanStartTimeShift: '1h'
spanEndTimeShift: '-1h'
spanBar:
type: 'None'
secureJsonData:
basicAuthPassword: my_password
The cool thing about Jeager is it correlates Prometheus metrics with Loki logs by TraceID in real time which make it possible creating live component graph:
While mean time to recovery (MTTR) is just one metric in a DevOps team’s toolbox, it’s one of the most comprehensive, because it encompasses the entire incident, from first identification all the way through to resolution when all systems have recovered and returned to normal. Using an correct tool set like Grafa or Azure AppInsights not only helps you catch problems that may lead to incidents before they get deployed to production, it can also help you debug and recover from incidents after they happen.
Hope you like the post. For more SRE content please subscribe to our newsletter, follow us on Twitter and LinkedIn.
Save your privacy, bean ethical!