Production Observability on Kubernetes with Prometheus and Grafana

Why Observability Matters

Running Kubernetes in production without observability is flying blind. When something breaks at 2am, you need metrics, logs, and traces to diagnose fast. After setting up monitoring stacks across multiple production environments, here’s the setup I always reach for.

The Stack

Prometheus — metrics collection and alerting
Grafana — visualization and dashboards
kube-state-metrics — Kubernetes object metrics
node-exporter — host-level metrics
Alertmanager — alert routing and silencing

The easiest way to get all of this is the kube-prometheus-stack Helm chart.

Installation

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values values.yaml

My base values.yaml:

grafana:
  enabled: true
  adminPassword: "change-me-use-secrets-manager"
  ingress:
    enabled: true
    hosts:
      - grafana.internal.yourdomain.com

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 10Gi

Alerting Rules

Prometheus alert rules are defined as Kubernetes resources:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  groups:
    - name: availability
      rules:
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
            description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has restarted more than once in the last 5 minutes."

        - alert: HighMemoryUsage
          expr: |
            (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} memory above 85%"

Routing Alerts to Slack

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-kube-prometheus-stack-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: slack-notifications
      routes:
        - match:
            severity: critical
          receiver: slack-critical

    receivers:
      - name: slack-notifications
        slack_configs:
          - channel: '#alerts'
            title: '{{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

      - name: slack-critical
        slack_configs:
          - channel: '#alerts-critical'
            title: '🔴 CRITICAL: {{ .GroupLabels.alertname }}'

Key Dashboards to Import

The kube-prometheus-stack ships with solid default dashboards, but these Grafana IDs are worth adding:

Dashboard	Grafana ID
Kubernetes Cluster Overview	7249
Node Exporter Full	1860
Kubernetes Deployments	8588
JVM Micrometer (Spring Boot)	4701

Application Metrics — Spring Boot Example

For Java/Spring Boot services, expose metrics via Actuator + Micrometer:

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus
  metrics:
    export:
      prometheus:
        enabled: true

Then configure Prometheus to scrape it via a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-spring-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-spring-app
  endpoints:
    - port: http
      path: /actuator/prometheus
      interval: 15s

Conclusion

A solid observability stack transforms incident response from guesswork to diagnosis. The kube-prometheus-stack gets you 80% of the way there out of the box — the remaining 20% is custom alert rules and application-level metrics tuned to your specific workloads.