Distributed alert management (DAM) allows automatically identify a non-compliance of service level objectives and any risky activities inside a cluster and GitOps infrastructure. In my previous post, I presented a redundant monitoring infrastructure based on variety of tools such like Grafana, Prometheus, Loki and Thanos. This article focuses on a way to integrate continuous monitoring and alerting with modern Kubernetes cluster.
To collect real-time information from Kubernetes cluster we would use Prometheus. Prometheus is a tool to grab and stores metrics as time series data. Than, the Alert Manager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, SMS, MS Teams channel and so on. It also takes care of silencing and inhibition of alerts.
Prometheus operator
All alerts inside Prometheus Alert Manager configured using yaml format. Alert Manger it self configured for high availability. The essential part of solution is Prometheus Operator. The Prometheus Operator introduces an Alertmanager
resource, which allows users to declaratively describe an Alertmanager cluster. To successfully deploy an Alertmanager cluster, it is important to understand the contract between Prometheus and Alertmanager. Alertmanager is used to:
- Deduplicate alerts received from Prometheus.
- Silence alerts.
- Route and send grouped notifications to various integrations (PagerDuty, OpsGenie, mail, chat, …).
First, we need to deploy Alert Manger cluster:
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: example
spec:
replicas: 3
kubectl get pods -l alertmanager=example -w
Next, creates an AlertmanagerConfig.yaml resource to sends notifications to a appropriate AWS Lambda webhook:
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: example-alertmanagerconfig
spec:
route:
groupBy: ['alertname']
groupWait: 30s
groupInterval: 5m
repeatInterval: 12h
receiver: 'lambda-webhook'
matchers:
- name: namespace
matchType: =
value: "my-cluster-namespace"
- name: severity
matchType: =~
value: "warning|critical|error"
receivers:
- name: 'lambda-webhook'
webhookConfigs:
- urlSecret:
key: 'http://my-lambda-webhook/'
name: alertmanager
The PrometheusRule
CRD allows to define alerting and recording rules. The operator knows which PrometheusRule objects to select for a given Prometheus based on the spec.ruleSelector
field. By default, the Prometheus resources discovers only PrometheusRule
resources in the same namespace. This can be refined with the ruleNamespaceSelector
field:
- To discover rules from all namespaces, pass an empty dict (
ruleNamespaceSelector: {}
). - To discover rules from all namespaces matching a certain label, use the
matchLabels
field
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
prometheus: example
role: alert-rules
name: prometheus-alerts
spec:
groups:
- name: ./example.rules
rules:
- alert: ExampleAlert
expr: vector(1)
And then display rules with:
kubectl get prometheusrule -n prometheus -o yaml
Flux notification controller
The Notification Controller is a Kubernetes operator, specialized in handling inbound and outbound events. For more information go to fluxcd.io. The controller handles events coming from external systems (GitHub, GitLab, Bitbucket, Harbor, Jenkins, etc) and notifies the GitOps toolkit controllers about source changes.
We send events to the Prometheus Alert Manager using a dedicated provider for Prometheus Alert Manager. Alert Manager distributes notifications according to the previous configuration (to an MS Teams channel). We catch errors related to all Flux objects. Below is a sample configuration for test environment:
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Provider
metadata:
name: alertmanager
namespace: prometheus-alerts
spec:
type: alertmanager
address: http://alertmanager-operated.prometheus:9093/api/v2/alerts/
---
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
metadata:
name: errors
namespace: prometheus-alerts
spec:
providerRef:
name: alertmanager
eventSeverity: error
eventSources:
- kind: HelmRelease
name: '*'
namespace: prometheus-alerts
- kind: ImagePolicy
name: '*'
namespace: prometheus-alerts
- kind: Kustomization
name: '*'
namespace: prometheus-alerts
- kind: ImageRepository
name: '*'
namespace:prometheus-alerts
- kind: GitRepository
name: '*'
namespace: prometheus-alerts
- kind: HelmChart
name: '*'
namespace: prometheus-alerts
- kind: HelmRepository
name: '*'
namespace: prometheus-alerts
- kind: OCIRepository
name: '*'
namespace: prometheus-alerts
In addition to the Flux Notification Controller, we use dashboard in Grafana to visualize cluster behavior in time. As a bonus, Grafana allows exploring logs from Loki and build graphs on top of Prometheus metrics.
Hope, you like the post. Please follow me on Twitter or LinkedIn and subscribe to newsletter below.
Be an ethical, save your privacy!