Prometheus guide

What Prometheus does in production

Prometheus is a monitoring system and time-series database. It answers questions like “what is the error rate right now?”, “which instances are out of disk?”, and “did latency spike after the deploy?” Applications expose metrics in a text format; Prometheus pulls them periodically and stores samples with labels. Alerts fire from rules; humans explore data in the UI or — most often — in Grafana.

Prometheus and Grafana

Prometheus collects, stores, and queries metrics (PromQL). Grafana connects to Prometheus as a data source and builds dashboards, graphs, and tables for visualization. Typical stack: exporters and apps → Prometheus (scrape + store) → Grafana (dashboards) → Alertmanager (notifications). You troubleshoot missing graphs in Grafana by verifying the Prometheus query and that scrapes are healthy — not only the panel config.

Pull model and exporters

Prometheus scrapes /metrics (or custom paths) on an interval (scrape_interval). Targets can be:

Instrumented apps — Prometheus client libraries expose metrics
Exporters — sidecars that translate stats to Prometheus format (e.g. node_exporter for Linux host metrics)
Pushgateway — for short-lived batch jobs that cannot be scraped (push, then Prometheus pulls from gateway)

Default scrape port for Prometheus itself is 9090. Check target health at Status → Targets in the Prometheus UI.

Metrics format and labels

Metrics are named counters, gauges, histograms, or summaries, with labels for dimensions: http_requests_total{method="GET",status="200"}. Labels enable aggregation in PromQL but explode storage if cardinality is too high (e.g. user ID as a label). Prefer bounded label sets.

Configuration overview

Main file: prometheus.yml (often /etc/prometheus/prometheus.yml).

scrape_configs — jobs, targets, intervals, relabeling
rule_files — alerting and recording rules
alerting — Alertmanager endpoints
remote_write / remote_read — long-term storage integrations (optional)

Reload config without restart: curl -X POST localhost:9090/-/reload (if --web.enable-lifecycle enabled) or systemctl reload prometheus.

PromQL essentials

PromQL queries time series. Examples: rate(http_requests_total[5m]) for per-second rate over 5 minutes; node_memory_MemAvailable_bytes for a gauge. Use Grafana’s Explore view to prototype PromQL before adding to dashboards. Recording rules precompute expensive queries.

Alerting

Alerting rules in Prometheus evaluate PromQL expressions; firing alerts go to Alertmanager for grouping, silencing, and routing (PagerDuty, Slack, email). Keep alerts actionable — “disk full in 4h” beats “cpu > 50%” with no context.

Service discovery

Static targets lists work for small setups. Production often uses Kubernetes SD, file_sd, or cloud SD so new pods are scraped automatically. Relabeling drops or rewrites labels before ingest. See the Kubernetes lab when pod targets stay down.

Storage and retention

Prometheus stores blocks on local disk (TSDB). Retention defaults (~15 days) is controlled by --storage.tsdb.retention.time. High-cardinality metrics or long retention fills disk fast — monitor Prometheus’s own metrics and disk.

Key paths and service

Config — /etc/prometheus/prometheus.yml
Data — /var/lib/prometheus/
Rules — /etc/prometheus/rules/*.yml
Service — prometheus (systemd)
Grafana — separate service (often port 3000); add Prometheus URL as data source

Learning resources

Prometheus documentation — prometheus.io/docs
PromQL — Querying basics
Grafana — Prometheus data source — grafana.com — Prometheus
Node exporter — node_exporter
Alertmanager — Alertmanager docs

Practice scenarios

Hands-on Prometheus scenarios on live Linux VMs: prometheus

Troubleshooting →