Prometheus guide
What Prometheus does in production
Prometheus is a monitoring system and time-series database. It answers questions like “what is the error rate right now?”, “which instances are out of disk?”, and “did latency spike after the deploy?” Applications expose metrics in a text format; Prometheus pulls them periodically and stores samples with labels. Alerts fire from rules; humans explore data in the UI or — most often — in Grafana.
Prometheus and Grafana
Prometheus collects, stores, and queries metrics (PromQL). Grafana connects to Prometheus as a data source and builds dashboards, graphs, and tables for visualization. Typical stack: exporters and apps → Prometheus (scrape + store) → Grafana (dashboards) → Alertmanager (notifications). You troubleshoot missing graphs in Grafana by verifying the Prometheus query and that scrapes are healthy — not only the panel config.
Pull model and exporters
Prometheus scrapes /metrics (or custom paths) on an interval
(scrape_interval). Targets can be:
- Instrumented apps — Prometheus client libraries expose metrics
- Exporters — sidecars that translate stats to Prometheus format (e.g.
node_exporterfor Linux host metrics) - Pushgateway — for short-lived batch jobs that cannot be scraped (push, then Prometheus pulls from gateway)
Default scrape port for Prometheus itself is 9090. Check target health
at Status → Targets in the Prometheus UI.
Metrics format and labels
Metrics are named counters, gauges, histograms, or summaries, with
labels for dimensions:
http_requests_total{method="GET",status="200"}. Labels enable
aggregation in PromQL but explode storage if cardinality is too high (e.g. user
ID as a label). Prefer bounded label sets.
Configuration overview
Main file: prometheus.yml (often /etc/prometheus/prometheus.yml).
scrape_configs— jobs, targets, intervals, relabelingrule_files— alerting and recording rulesalerting— Alertmanager endpointsremote_write / remote_read— long-term storage integrations (optional)
Reload config without restart:
curl -X POST localhost:9090/-/reload (if
--web.enable-lifecycle enabled) or systemctl reload prometheus.
PromQL essentials
PromQL queries time series. Examples: rate(http_requests_total[5m])
for per-second rate over 5 minutes; node_memory_MemAvailable_bytes
for a gauge. Use Grafana’s Explore view to prototype PromQL before adding to
dashboards. Recording rules precompute expensive queries.
Alerting
Alerting rules in Prometheus evaluate PromQL expressions; firing alerts go to Alertmanager for grouping, silencing, and routing (PagerDuty, Slack, email). Keep alerts actionable — “disk full in 4h” beats “cpu > 50%” with no context.
Service discovery
Static targets lists work for small setups. Production often uses
Kubernetes SD, file_sd, or cloud SD so new pods
are scraped automatically. Relabeling drops or rewrites labels before ingest.
See the Kubernetes lab when pod targets stay down.
Storage and retention
Prometheus stores blocks on local disk (TSDB). Retention defaults (~15 days) is
controlled by --storage.tsdb.retention.time. High-cardinality metrics
or long retention fills disk fast — monitor Prometheus’s own metrics and disk.
Key paths and service
- Config —
/etc/prometheus/prometheus.yml - Data —
/var/lib/prometheus/ - Rules —
/etc/prometheus/rules/*.yml - Service —
prometheus(systemd) - Grafana — separate service (often port
3000); add Prometheus URL as data source
Learning resources
- Prometheus documentation — prometheus.io/docs
- PromQL — Querying basics
- Grafana — Prometheus data source — grafana.com — Prometheus
- Node exporter — node_exporter
- Alertmanager — Alertmanager docs
Practice scenarios
Hands-on Prometheus scenarios on live Linux VMs: prometheus