Prometheus troubleshooting

Target down (up == 0)

Open /targets in Prometheus UI — read the error column (connection refused, timeout, 404, SSL). Verify host:port, firewall, and that the exporter listens on 0.0.0.0 not only localhost. Wrong metrics_path or Kubernetes pod not ready are frequent causes. Test: curl http://target:9100/metrics from the Prometheus host.

Grafana shows “No data”

Separate Prometheus health from Grafana config. In Grafana → Explore, run the same PromQL against the Prometheus data source. If empty there, scraping or metric names are wrong — not the dashboard. Check data source URL (http://prometheus:9090), time range, and label filters. Prometheus UI /graph should show the series first.

Scrape timeout / context deadline exceeded

Target too slow or returns huge metric payloads. Increase scrape_timeout for that job or fix the exporter. Check scrape_duration_seconds. Cardinality explosions (millions of series) slow scrapes — drop high-cardinality labels at the app or via metric_relabel_configs.

PromQL returns empty or unexpected results

Wrong metric name, labels, or using rate() on a gauge. Inspect available series: {__name__=~".+"} or Grafana metric browser. Counter needs rate(metric[5m]); gauge does not. Label matchers are case-sensitive. Recording rule name differs from raw metric.

Config reload fails / Prometheus won’t start

Run promtool check config /etc/prometheus/prometheus.yml and promtool check rules /etc/prometheus/rules/*.yml. YAML indentation errors and duplicate rule names are common. Check journal: journalctl -u prometheus -e.

Disk full / TSDB corruption

Reduce retention.time, drop noisy metrics, or add disk. Monitor prometheus_tsdb_storage_blocks_bytes. After crash, Prometheus may need time to replay WAL — watch logs. See disk volumes lab.

Alerts not firing or spamming

Rules need for: duration to avoid flapping. Check ALERTS and ALERTS_FOR_STATE metrics in Prometheus. Alertmanager routes may drop or group alerts — check alertmanager:9093 UI and amtool alert. Silence during maintenance via Alertmanager silences.

Kubernetes pods not scraped

Verify annotations prometheus.io/scrape: "true", port, and path if using classic patterns — or Prometheus Operator ServiceMonitor CRD. RBAC must let Prometheus list pods/services. Targets page shows “unknown” labels if relabeling drops needed metadata.

Debugging workflow

1. Targets and up metric

curl -s localhost:9090/api/v1/query?query=up | jq .
# Or open http://localhost:9090/targets

2. Query in Prometheus, then Grafana

# Prometheus /graph — confirm series exists
# Grafana → Explore — same query, same time range

3. Config and logs

promtool check config /etc/prometheus/prometheus.yml
journalctl -u prometheus -n 50 --no-pager

Practice scenarios

Hands-on Prometheus scenarios on live Linux VMs: prometheus

Cheatsheet →