etcd troubleshooting

Connection refused / endpoint unhealthy

etcd not running or wrong endpoint/TLS. Check ss -tlnp | grep 2379 and member logs. On Kubernetes control planes, inspect the etcd static pod: crictl logs $(crictl ps --name etcd -q). Verify ETCDCTL_ENDPOINTS and certificate paths match the server cert. A single unhealthy member may still allow quorum through others.

No quorum / cluster unavailable

Majority of members must be up for reads and writes. A 3-node cluster with 2 down is read-only at best and usually fully unavailable. Restore failed nodes or replace members — do not start isolated members with stale data (split brain). Check etcdctl endpoint health --cluster and etcdctl member list. Kubernetes symptoms: API server timeouts, kubectl hangs, controllers stop reconciling.

Leader election / high latency

Frequent leader changes indicate network issues, disk latency, or CPU starvation. Run etcdctl endpoint status -w table — note which member is leader and RAFT TERM. Check disk I/O on etcd data dir (WAL fsync is latency-sensitive). On Kubernetes, etcd should run on fast SSD with dedicated resources — avoid sharing disks with heavy workloads.

NOSPACE alarm / database size limit

etcd stops accepting writes when disk is full or quota is hit. List alarms: etcdctl alarm list. Free disk space on --data-dir, compact old revisions (etcdctl compact), defrag (etcdctl defrag), then etcdctl alarm disarm. For Kubernetes, apiserver compaction misconfiguration can let etcd grow — check --etcd-compaction-interval on kube-apiserver.

CORRUPT alarm / data corruption

Serious — indicates WAL or snapshot corruption. Stop the affected member, restore from a recent etcdctl snapshot save backup, or replace the member and let it sync from the cluster (if others are healthy). Do not delete data dirs on multiple members at once. For Kubernetes, follow official disaster-recovery docs before destructive steps.

TLS / certificate errors

x509: certificate signed by unknown authority or expired certs block clients. Kubernetes rotates certs on upgrade — mismatched apiserver and etcd certs break the control plane. Verify expiry with openssl x509 -in server.crt -noout -dates. Ensure ETCDCTL_CACERT, CERT, and KEY match the server's trust chain.

Kubernetes API server cannot reach etcd

Check --etcd-servers and TLS flags in /etc/kubernetes/manifests/kube-apiserver.yaml. Apiserver logs show etcd connection errors. etcd must be reachable at the URLs listed — often https://127.0.0.1:2379 for stacked etcd. If etcd is healthy but apiserver is not, the problem is usually certs or wrong endpoint in the manifest.

Member failed to join cluster

New members need correct --initial-cluster-state=existing and peer URLs reachable on port 2380. Add via etcdctl member add before starting the new process. Hostnames in member lists must resolve between all nodes — avoid localhost for cross-host clusters. Firewall rules must allow 2379 (client) and 2380 (peer).

Debugging workflow

1. Cluster health

export ETCDCTL_API=3
etcdctl endpoint health --cluster
etcdctl endpoint status -w table
etcdctl alarm list

2. Disk and size

df -h /var/lib/etcd
etcdctl endpoint status -w json | jq '.[].DbSize'

3. Backup before changes

etcdctl snapshot save /backup/etcd-$(date +%F).db
etcdctl snapshot status /backup/etcd-$(date +%F).db -w table

Practice scenarios

Hands-on etcd scenarios on live Linux VMs: etcd

Cheatsheet →