Envoy troubleshooting

503 Service Unavailable / no healthy upstream

Access log flag UH — cluster has no healthy endpoints. Check curl localhost:9901/clusters for health_flags and host counts. Causes: all backends down, active health checks failing, wrong ports, or endpoints not registered (EDS stale). Verify backend pods with kubectl get endpoints. Fix health check path/port to match the app.

Upstream connection failure (UF)

Envoy cannot TCP-connect to the chosen host. Connection refused, wrong IP, network policy blocking sidecar egress, or mTLS mismatch in mesh. From the Envoy container: curl -v http://backend:8080. Check cluster endpoints in /config_dump. Istio: verify DestinationRule and Service port names (http vs grpc protocol).

404 / NR (no route configured)

Request reached Envoy but no route matched host or path. Inspect routes in config_dump under route_configs. Virtual host domains must include the request Host header. Common in mesh: missing VirtualService or wrong gateway binding. Test with explicit Host: curl -H "Host: api.example.com" http://localhost:10000/path.

TLS / certificate errors

Downstream (client→Envoy) or upstream (Envoy→backend) TLS failures. Check SDS secrets in config_dump, cert expiry, SNI hostname, and whether upstream expects plain HTTP while Envoy sends TLS (or vice versa). Istio PeerAuthentication and DestinationRule TLS modes (DISABLE, SIMPLE, MUTUAL) must align.

Envoy won't start / config rejected

Run envoy --mode validate -c envoy.yaml before restart. Invalid YAML, wrong @type URLs for filter versions, or missing cluster references fail bootstrap. Read container logs: kubectl logs POD -c istio-proxy. Version skew between Envoy build and config API version breaks typed_config.

Stale endpoints after scale-up/down

Envoy still targets old IPs. xDS should push EDS updates — if stale, control plane issue or disconnected ADS stream. Check istioctl proxy-status for STALE or not synced. Restarting the Envoy pod forces fresh config. For static DNS clusters, ensure TTL/refresh respects DNS changes (STRICT_DNS vs LOGICAL_DNS).

Circuit breaking / UO (upstream overflow)

Too many connections or pending requests to a cluster — circuit breaker open. Stats: upstream_rq_pending_overflow, cluster.rq_retry_overflow. Raise limits or fix slow backends. May indicate retry storms — reduce per-route retry count.

High latency or timeouts

Compare upstream_rq_time stats and access log durations. Check route timeouts vs application latency. Retries doubling load show as duplicate upstream requests. Enable debug logging temporarily: curl -X POST localhost:9901/logging?http=debug (revert after).

Admin port not reachable

Admin binds 127.0.0.1 by default — only local. Use kubectl port-forward or exec curl localhost:9901. Do not expose admin publicly without auth — it allows config dump and shutdown. Istio: admin on 15000 on the sidecar loopback.

Debugging workflow

1. Admin health snapshot

curl -s localhost:9901/server_info
curl -s localhost:9901/clusters | head -80
curl -s localhost:9901/stats | grep -E 'rq_5xx|upstream_cx_connect_fail'

2. Config and routes

curl -s localhost:9901/config_dump | jq '.configs[1].dynamic_route_configs'
# Or save full dump: curl -s localhost:9901/config_dump > /tmp/envoy-config.json

3. Request test + access log

curl -v http://localhost:10000/your-path
# Read response flags (UF, UH, NR) in access log line

Practice scenarios

Hands-on Envoy scenarios on live Linux VMs: envoy

Cheatsheet →