Β· Monitoring  Β· 3 min read

Production Observability on Kubernetes with Prometheus and Grafana

You can't fix what you can't see. Here's how I set up a full observability stack on EKS using Prometheus, Grafana, and the kube-prometheus-stack.

You can't fix what you can't see. Here's how I set up a full observability stack on EKS using Prometheus, Grafana, and the kube-prometheus-stack.

Why Observability Matters

Running Kubernetes in production without observability is flying blind. When something breaks at 2am, you need metrics, logs, and traces to diagnose fast. After setting up monitoring stacks across multiple production environments, here’s the setup I always reach for.

The Stack

  • Prometheus β€” metrics collection and alerting
  • Grafana β€” visualization and dashboards
  • kube-state-metrics β€” Kubernetes object metrics
  • node-exporter β€” host-level metrics
  • Alertmanager β€” alert routing and silencing

The easiest way to get all of this is the kube-prometheus-stack Helm chart.

Installation

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values values.yaml

My base values.yaml:

grafana:
  enabled: true
  adminPassword: "change-me-use-secrets-manager"
  ingress:
    enabled: true
    hosts:
      - grafana.internal.yourdomain.com

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 10Gi

Alerting Rules

Prometheus alert rules are defined as Kubernetes resources:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
spec:
  groups:
    - name: availability
      rules:
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
            description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has restarted more than once in the last 5 minutes."

        - alert: HighMemoryUsage
          expr: |
            (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Container {{ $labels.container }} memory above 85%"

Routing Alerts to Slack

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-kube-prometheus-stack-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: slack-notifications
      routes:
        - match:
            severity: critical
          receiver: slack-critical

    receivers:
      - name: slack-notifications
        slack_configs:
          - channel: '#alerts'
            title: '{{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

      - name: slack-critical
        slack_configs:
          - channel: '#alerts-critical'
            title: 'πŸ”΄ CRITICAL: {{ .GroupLabels.alertname }}'

Key Dashboards to Import

The kube-prometheus-stack ships with solid default dashboards, but these Grafana IDs are worth adding:

DashboardGrafana ID
Kubernetes Cluster Overview7249
Node Exporter Full1860
Kubernetes Deployments8588
JVM Micrometer (Spring Boot)4701

Application Metrics β€” Spring Boot Example

For Java/Spring Boot services, expose metrics via Actuator + Micrometer:

<!-- pom.xml -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,prometheus
  metrics:
    export:
      prometheus:
        enabled: true

Then configure Prometheus to scrape it via a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-spring-app
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: my-spring-app
  endpoints:
    - port: http
      path: /actuator/prometheus
      interval: 15s

Conclusion

A solid observability stack transforms incident response from guesswork to diagnosis. The kube-prometheus-stack gets you 80% of the way there out of the box β€” the remaining 20% is custom alert rules and application-level metrics tuned to your specific workloads.


---
Back to Blog

Related Posts

View All Posts Β»
GitOps on Kubernetes with ArgoCD

GitOps on Kubernetes with ArgoCD

ArgoCD changed how I think about deployments. Here's how to set up GitOps for your Kubernetes workloads β€” and why you won't go back to manual kubectl applies.