SadServers
  • Scenarios
  • Labs
    All Labs Linux & Bash Web Servers Databases Data Processing Docker Kubernetes CI/CD Infrastructure as Code Tooling / Applications
  • Dashboard
  • Solutions
    For Individuals For Businesses
  • Ranking
  • Newsletter
  • Documentation
    FAQ Support Pro Accounts Pro+ Accounts Business Accounts Gift API CLI/TUI Privacy Troubleshooting Interviews
  • Blog
  • Pricing
  • Gift
    Gift Purchase Gift Redeem
  • About
Log In - Sign Up
  1. Labs
  2. etcd
  3. Guide

Guide

Concepts and learning path

Troubleshooting

Failure modes and fixes

Cheatsheet

Commands to keep handy

etcd guide

What etcd does in production

etcd stores small amounts of critical data with strong consistency guarantees. Unlike a general-purpose database, it is optimized for reliable reads/writes of configuration and metadata — not large documents or analytics workloads.

  • Kubernetes — the API server's only persistent store; cluster state lives here
  • Service discovery — register and watch endpoints for dynamic services
  • Distributed locking — coordinate leaders and exclusive work across nodes
  • Configuration — shared settings that many processes must agree on

etcd and Kubernetes

Every Kubernetes cluster depends on etcd. When you kubectl apply a Deployment, the API server validates the request and writes the object to etcd. Controllers and kubelets watch etcd (via the API server) for changes. If etcd is unavailable or corrupt, the control plane cannot function — existing workloads on nodes may keep running, but scheduling, scaling, and updates stop.

Managed Kubernetes (EKS, GKE, AKS) hides etcd, but self-managed and kubeadm clusters expose it directly. Troubleshooting a broken cluster often means checking etcd health, quorum, and disk. See also the Kubernetes scenarios on SadServers.

How a client connection works

When a client (etcdctl, kube-apiserver, or an app using the etcd client library) connects:

  1. TCP connect — client reaches host:2379 (client port)
  2. TLS / auth — production clusters use mutual TLS and optionally RBAC users
  3. gRPC request — get, put, delete, watch, or transaction against a key prefix
  4. Raft commit — writes go to the leader and replicate to a majority before ack
  5. Response — result returns; watches stream subsequent changes

Reads can be linearizable (default, goes through leader) or serializable (may read from any member with slightly relaxed guarantees). Kubernetes requires a healthy, consistent etcd — treat outages as control-plane emergencies.

Key files and configuration

  • /etc/etcd/etcd.conf.yml — common config path (varies by install)
  • --data-dir — WAL and snapshot storage (e.g. /var/lib/etcd/)
  • --listen-client-urls — client API bind (port 2379)
  • --listen-peer-urls — member-to-member Raft traffic (port 2380)
  • --initial-cluster — bootstrap member list for new clusters
  • Certificates — --cert-file, --key-file, --trusted-ca-file for TLS

Core concepts

  • Keys and values — byte strings; Kubernetes uses hierarchical paths like /registry/pods/...
  • Revision — cluster-wide monotonic counter; every change increments it
  • Leases — TTL-bound keys; Kubernetes uses leases for node heartbeats
  • Watches — long-poll streams of changes from a revision — how controllers react
  • Transactions — compare-and-swap style atomic multi-op updates

Cluster and replication (Raft)

etcd clusters have an odd number of members (typically 3 or 5) for quorum. One member is the leader at a time; followers replicate the Raft log. Writes succeed when committed to a majority — a 3-node cluster tolerates 1 failure; 5-node tolerates 2.

Member ports: clients use 2379; peers communicate on 2380. Each member needs a unique name and reachable peer URL.

Adding a member: use etcdctl member add (or API), update all members' configuration, then start the new node with the updated cluster state. Removing a member requires quorum — never force-remove a majority of nodes.

Monitor with etcdctl endpoint health, etcdctl endpoint status, and etcdctl member list. Kubernetes stacks often wrap these in etcdctl inside static pods on control-plane nodes.

Maintenance

etcd retains a revision history — without compaction, disk grows unbounded. Run compaction to drop old revisions (Kubernetes apiserver compacts periodically; standalone clusters need etcdctl compact). After compaction, defragmentation (etcdctl defrag) reclaims disk space from the BoltDB backend. Take a snapshot before major maintenance.

Snapshots — point-in-time backup of cluster state. Schedule regular snapshots; they are the primary disaster-recovery path for etcd and thus for Kubernetes control-plane state.

Backups

Replication is not a backup — a bad write committed to quorum is replicated everywhere. Use etcdctl snapshot save for consistent backups. Restore with etcdctl snapshot restore into a fresh data directory (restore creates a new cluster state — plan member IDs and initial cluster config carefully).

For Kubernetes: back up etcd before control-plane upgrades. kubeadm documents snapshot procedures for stacked etcd. Test restore on a staging cluster — an untested backup is not a backup.

Learning resources

  • etcd documentation — etcd.io/docs
  • Operations guide — etcd.io — operations
  • Kubernetes etcd — kubernetes.io — configure and upgrade etcd
  • Raft consensus — raft.github.io

Practice scenarios

Hands-on etcd scenarios on live Linux VMs: etcd

Troubleshooting →
SadServersSadServers

Real-world Linux and DevOps scenarios for hands-on learning and technical assessment.

Uptime Robot ratio (30 days)
Product
  • Scenarios
  • For Individuals
  • For Businesses
  • Pricing
Resources
  • FAQ
  • Blog
  • Newsletter
Company
  • About Us
  • Support
  • Privacy Policy
  • Terms of Service
  • Contact
Connect With Us
info@sadservers.com

Made in Canada 🇨🇦
Updated: 2026-06-13 16:06 UTC – 2d2950a