etcd troubleshooting
Connection refused / endpoint unhealthy
etcd not running or wrong endpoint/TLS. Check
ss -tlnp | grep 2379 and member logs. On Kubernetes control
planes, inspect the etcd static pod:
crictl logs $(crictl ps --name etcd -q). Verify
ETCDCTL_ENDPOINTS and certificate paths match the server cert.
A single unhealthy member may still allow quorum through others.
No quorum / cluster unavailable
Majority of members must be up for reads and writes. A 3-node cluster with 2
down is read-only at best and usually fully unavailable. Restore failed nodes
or replace members — do not start isolated members with stale data (split brain).
Check etcdctl endpoint health --cluster and
etcdctl member list. Kubernetes symptoms: API server timeouts,
kubectl hangs, controllers stop reconciling.
Leader election / high latency
Frequent leader changes indicate network issues, disk latency, or CPU starvation.
Run etcdctl endpoint status -w table — note which member is
leader and RAFT TERM. Check disk I/O on etcd data dir (WAL fsync
is latency-sensitive). On Kubernetes, etcd should run on fast SSD with
dedicated resources — avoid sharing disks with heavy workloads.
NOSPACE alarm / database size limit
etcd stops accepting writes when disk is full or quota is hit. List alarms:
etcdctl alarm list. Free disk space on --data-dir,
compact old revisions (etcdctl compact), defrag
(etcdctl defrag), then etcdctl alarm disarm. For
Kubernetes, apiserver compaction misconfiguration can let etcd grow — check
--etcd-compaction-interval on kube-apiserver.
CORRUPT alarm / data corruption
Serious — indicates WAL or snapshot corruption. Stop the affected member,
restore from a recent etcdctl snapshot save backup, or replace
the member and let it sync from the cluster (if others are healthy). Do not
delete data dirs on multiple members at once. For Kubernetes, follow official
disaster-recovery docs before destructive steps.
TLS / certificate errors
x509: certificate signed by unknown authority or expired certs
block clients. Kubernetes rotates certs on upgrade — mismatched apiserver and
etcd certs break the control plane. Verify expiry with
openssl x509 -in server.crt -noout -dates. Ensure
ETCDCTL_CACERT, CERT, and KEY match
the server's trust chain.
Kubernetes API server cannot reach etcd
Check --etcd-servers and TLS flags in
/etc/kubernetes/manifests/kube-apiserver.yaml. Apiserver logs
show etcd connection errors. etcd must be reachable at the URLs listed — often
https://127.0.0.1:2379 for stacked etcd. If etcd is healthy but
apiserver is not, the problem is usually certs or wrong endpoint in the manifest.
Member failed to join cluster
New members need correct --initial-cluster-state=existing and
peer URLs reachable on port 2380. Add via
etcdctl member add before starting the new process. Hostnames in
member lists must resolve between all nodes — avoid localhost for
cross-host clusters. Firewall rules must allow 2379 (client) and 2380 (peer).
Debugging workflow
1. Cluster health
export ETCDCTL_API=3
etcdctl endpoint health --cluster
etcdctl endpoint status -w table
etcdctl alarm list2. Disk and size
df -h /var/lib/etcd
etcdctl endpoint status -w json | jq '.[].DbSize'3. Backup before changes
etcdctl snapshot save /backup/etcd-$(date +%F).db
etcdctl snapshot status /backup/etcd-$(date +%F).db -w tablePractice scenarios
Hands-on etcd scenarios on live Linux VMs: etcd