Envoy troubleshooting
503 Service Unavailable / no healthy upstream
Access log flag UH — cluster has no healthy endpoints. Check
curl localhost:9901/clusters for health_flags and
host counts. Causes: all backends down, active health checks failing, wrong
ports, or endpoints not registered (EDS stale). Verify backend pods with
kubectl get endpoints. Fix health check path/port to match the app.
Upstream connection failure (UF)
Envoy cannot TCP-connect to the chosen host. Connection refused, wrong IP,
network policy blocking sidecar egress, or mTLS mismatch in mesh. From the
Envoy container: curl -v http://backend:8080. Check cluster
endpoints in /config_dump. Istio: verify DestinationRule and
Service port names (http vs grpc protocol).
404 / NR (no route configured)
Request reached Envoy but no route matched host or path. Inspect routes in
config_dump under route_configs. Virtual host
domains must include the request Host header. Common in mesh:
missing VirtualService or wrong gateway binding. Test with explicit Host:
curl -H "Host: api.example.com" http://localhost:10000/path.
TLS / certificate errors
Downstream (client→Envoy) or upstream (Envoy→backend) TLS failures. Check
SDS secrets in config_dump, cert expiry, SNI hostname, and whether
upstream expects plain HTTP while Envoy sends TLS (or vice versa). Istio
PeerAuthentication and DestinationRule TLS modes
(DISABLE, SIMPLE, MUTUAL) must align.
Envoy won't start / config rejected
Run envoy --mode validate -c envoy.yaml before restart. Invalid
YAML, wrong @type URLs for filter versions, or missing cluster
references fail bootstrap. Read container logs:
kubectl logs POD -c istio-proxy. Version skew between Envoy build
and config API version breaks typed_config.
Stale endpoints after scale-up/down
Envoy still targets old IPs. xDS should push EDS updates — if stale, control
plane issue or disconnected ADS stream. Check
istioctl proxy-status for STALE or not synced.
Restarting the Envoy pod forces fresh config. For static DNS clusters, ensure
TTL/refresh respects DNS changes (STRICT_DNS vs LOGICAL_DNS).
Circuit breaking / UO (upstream overflow)
Too many connections or pending requests to a cluster — circuit breaker open.
Stats: upstream_rq_pending_overflow,
cluster.rq_retry_overflow. Raise limits or fix slow backends.
May indicate retry storms — reduce per-route retry count.
High latency or timeouts
Compare upstream_rq_time stats and access log durations. Check
route timeouts vs application latency. Retries doubling load show as duplicate
upstream requests. Enable debug logging temporarily:
curl -X POST localhost:9901/logging?http=debug (revert after).
Admin port not reachable
Admin binds 127.0.0.1 by default — only local. Use
kubectl port-forward or exec curl localhost:9901.
Do not expose admin publicly without auth — it allows config dump and shutdown.
Istio: admin on 15000 on the sidecar loopback.
Debugging workflow
1. Admin health snapshot
curl -s localhost:9901/server_info
curl -s localhost:9901/clusters | head -80
curl -s localhost:9901/stats | grep -E 'rq_5xx|upstream_cx_connect_fail'2. Config and routes
curl -s localhost:9901/config_dump | jq '.configs[1].dynamic_route_configs'
# Or save full dump: curl -s localhost:9901/config_dump > /tmp/envoy-config.json3. Request test + access log
curl -v http://localhost:10000/your-path
# Read response flags (UF, UH, NR) in access log linePractice scenarios
Hands-on Envoy scenarios on live Linux VMs: envoy