Β· Monitoring Β· 3 min read
Production Observability on Kubernetes with Prometheus and Grafana
You can't fix what you can't see. Here's how I set up a full observability stack on EKS using Prometheus, Grafana, and the kube-prometheus-stack.
Why Observability Matters
Running Kubernetes in production without observability is flying blind. When something breaks at 2am, you need metrics, logs, and traces to diagnose fast. After setting up monitoring stacks across multiple production environments, hereβs the setup I always reach for.
The Stack
- Prometheus β metrics collection and alerting
- Grafana β visualization and dashboards
- kube-state-metrics β Kubernetes object metrics
- node-exporter β host-level metrics
- Alertmanager β alert routing and silencing
The easiest way to get all of this is the kube-prometheus-stack Helm chart.
Installation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values values.yamlMy base values.yaml:
grafana:
enabled: true
adminPassword: "change-me-use-secrets-manager"
ingress:
enabled: true
hosts:
- grafana.internal.yourdomain.com
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 10GiAlerting Rules
Prometheus alert rules are defined as Kubernetes resources:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
namespace: monitoring
spec:
groups:
- name: availability
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has restarted more than once in the last 5 minutes."
- alert: HighMemoryUsage
expr: |
(container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.container }} memory above 85%"Routing Alerts to Slack
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-kube-prometheus-stack-alertmanager
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-notifications
routes:
- match:
severity: critical
receiver: slack-critical
receivers:
- name: slack-notifications
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: slack-critical
slack_configs:
- channel: '#alerts-critical'
title: 'π΄ CRITICAL: {{ .GroupLabels.alertname }}'Key Dashboards to Import
The kube-prometheus-stack ships with solid default dashboards, but these Grafana IDs are worth adding:
| Dashboard | Grafana ID |
|---|---|
| Kubernetes Cluster Overview | 7249 |
| Node Exporter Full | 1860 |
| Kubernetes Deployments | 8588 |
| JVM Micrometer (Spring Boot) | 4701 |
Application Metrics β Spring Boot Example
For Java/Spring Boot services, expose metrics via Actuator + Micrometer:
<!-- pom.xml -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency># application.yml
management:
endpoints:
web:
exposure:
include: health,prometheus
metrics:
export:
prometheus:
enabled: trueThen configure Prometheus to scrape it via a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-spring-app
namespace: monitoring
spec:
selector:
matchLabels:
app: my-spring-app
endpoints:
- port: http
path: /actuator/prometheus
interval: 15sConclusion
A solid observability stack transforms incident response from guesswork to diagnosis. The kube-prometheus-stack gets you 80% of the way there out of the box β the remaining 20% is custom alert rules and application-level metrics tuned to your specific workloads.
---