Prometheus troubleshooting
Target down (up == 0)
Open /targets in Prometheus UI — read the error column (connection
refused, timeout, 404, SSL). Verify host:port, firewall, and that the exporter
listens on 0.0.0.0 not only localhost. Wrong metrics_path
or Kubernetes pod not ready are frequent causes. Test:
curl http://target:9100/metrics from the Prometheus host.
Grafana shows “No data”
Separate Prometheus health from Grafana config. In Grafana → Explore, run
the same PromQL against the Prometheus data source. If empty there, scraping
or metric names are wrong — not the dashboard. Check data source URL
(http://prometheus:9090), time range, and label filters.
Prometheus UI /graph should show the series first.
Scrape timeout / context deadline exceeded
Target too slow or returns huge metric payloads. Increase
scrape_timeout for that job or fix the exporter. Check
scrape_duration_seconds. Cardinality explosions (millions of
series) slow scrapes — drop high-cardinality labels at the app or via
metric_relabel_configs.
PromQL returns empty or unexpected results
Wrong metric name, labels, or using rate() on a gauge. Inspect
available series: {__name__=~".+"} or Grafana metric browser.
Counter needs rate(metric[5m]); gauge does not. Label matchers
are case-sensitive. Recording rule name differs from raw metric.
Config reload fails / Prometheus won’t start
Run promtool check config /etc/prometheus/prometheus.yml and
promtool check rules /etc/prometheus/rules/*.yml. YAML indentation
errors and duplicate rule names are common. Check journal:
journalctl -u prometheus -e.
Disk full / TSDB corruption
Reduce retention.time, drop noisy metrics, or add disk. Monitor
prometheus_tsdb_storage_blocks_bytes. After crash, Prometheus may
need time to replay WAL — watch logs. See
disk volumes lab.
Alerts not firing or spamming
Rules need for: duration to avoid flapping. Check
ALERTS and ALERTS_FOR_STATE metrics in Prometheus.
Alertmanager routes may drop or group alerts — check
alertmanager:9093 UI and amtool alert. Silence
during maintenance via Alertmanager silences.
Kubernetes pods not scraped
Verify annotations prometheus.io/scrape: "true", port, and path
if using classic patterns — or Prometheus Operator ServiceMonitor
CRD. RBAC must let Prometheus list pods/services. Targets page shows
“unknown” labels if relabeling drops needed metadata.
Debugging workflow
1. Targets and up metric
curl -s localhost:9090/api/v1/query?query=up | jq .
# Or open http://localhost:9090/targets2. Query in Prometheus, then Grafana
# Prometheus /graph — confirm series exists
# Grafana → Explore — same query, same time range3. Config and logs
promtool check config /etc/prometheus/prometheus.yml
journalctl -u prometheus -n 50 --no-pagerPractice scenarios
Hands-on Prometheus scenarios on live Linux VMs: prometheus