Kubernetes troubleshooting
Pod stuck in Pending
Scheduler could not place the pod. kubectl describe pod POD -n NS
— read Events. Causes: insufficient CPU/memory on nodes, no matching
nodeSelector/affinity, PVC not bound, taints without tolerations, or cluster
full. Check kubectl get nodes and
kubectl describe node NODE for allocatable resources.
ImagePullBackOff / ErrImagePull
Cannot pull container image. Wrong tag, private registry without
imagePullSecrets, registry down, or rate limits. Verify image name
in the pod spec. Test from a node or debug pod. For private registries, create
a Secret and reference it in the ServiceAccount or pod spec.
CrashLoopBackOff
Container starts, exits non-zero, kubelet restarts with backoff. Check
kubectl logs POD -n NS --previous for the crashed instance.
Common causes: app misconfig, missing env/Secret, wrong command, port already
in use inside container, failed migrations. Use
kubectl describe pod for restart count and exit codes.
OOMKilled
Container exceeded its memory limit. kubectl describe pod shows
Last State: Terminated, Reason: OOMKilled. Raise
resources.limits.memory or fix the leak. Node-level OOM may evict
pods — check kubectl get events and node pressure conditions.
Service not reachable
Trace the path: Service selector matches pod labels?
kubectl get endpoints SERVICE -n NS should list pod IPs.
Test from another pod: kubectl run tmp --rm -it --image=busybox -- wget -qO- http://SERVICE.NS.svc.
For Ingress issues, see the Traefik or
Nginx labs. Check NetworkPolicies blocking traffic.
Forbidden / RBAC errors
kubectl auth can-i VERB RESOURCE -n NS as the current user.
ServiceAccount needs a RoleBinding. Cluster-scoped resources need ClusterRole.
Wrong namespace is often mistaken for RBAC — confirm
kubectl config view --minify | grep namespace.
Node NotReady
Kubelet not reporting healthy. SSH to node (if allowed): check
systemctl status kubelet, disk pressure, and container runtime
(systemctl status containerd). Node conditions in
kubectl describe node — DiskPressure,
MemoryPressure, PIDPressure. Pods on NotReady nodes
may be evicted after timeout.
API server slow or kubectl timeouts
On self-managed clusters, check etcd health — see the
etcd lab. High API server load, etcd disk latency,
or too many watches can cause timeouts. Check control-plane node resources and
kubectl get --raw /healthz or curl via
kubectl proxy (see debugging workflow below). Managed clusters:
check cloud control-plane status.
Helm release problems
Failed installs/upgrades leave resources half-applied. Use the
Helm lab for helm status,
helm history, and rollback. Kubernetes object debugging still
applies — kubectl get pods -n RELEASE_NS and describe failing pods.
Debugging workflow
1. Scope and events
kubectl get pods -n NS -o wide
kubectl describe pod POD -n NS
kubectl get events -n NS --sort-by=.lastTimestamp | tail -202. Logs (kubectl and stern)
kubectl logs POD -n NS --previous
# With krew stern plugin — all matching pods:
stern -l app=myapp -n NS --since 10m3. kubectl exec — what the pod sees
Shell into a running container to inspect DNS, files, env, and connectivity from the pod network namespace — not from your laptop.
kubectl exec -it POD -n NS -- sh
kubectl exec POD -c CONTAINER -n NS -- cat /etc/resolv.conf
kubectl exec POD -n NS -- wget -qO- http://my-svc.my-ns.svc.cluster.local4. Port-forward — debug connectivity locally
Tunnel a pod or Service port to localhost when Ingress, firewalls, or cluster networking make direct access hard. Test HTTP, DB clients, or debug endpoints from your machine as if the service were local.
kubectl port-forward pod/POD 8080:80 -n NS
kubectl port-forward svc/my-svc 5432:5432 -n NS
curl http://localhost:8080/health5. Debug pods
Ephemeral toolbox pods or kubectl debug when the app image has no
shell or tools (distroless, minimal images). Netshoot/busybox carry
curl, dig, tcpdump, and traceroute.
# One-off debug pod in namespace
kubectl run netshoot --rm -it --image=nicolaka/netshoot -n NS -- bash
# Ephemeral debug container on existing pod (cluster must support it)
kubectl debug POD -n NS -it --image=busybox --target=CONTAINER
# Copy pod with debug sidecar (older clusters)
kubectl debug POD -n NS -it --copy-to=debug-pod --container=CONTAINER -- sh6. Hierarchy and cluster scan
kubectl tree deploy/myapp -n NS # krew tree
kubectl popeye -n NS # krew popeye — optional hygiene scan7. curl API via kubectl proxy
Query the Kubernetes API with curl for raw JSON and health checks.
kubectl proxy handles authentication locally — easiest for
troubleshooting from your laptop without crafting bearer tokens.
# Let kubectl proxy handle auth — easiest for local troubleshooting
kubectl proxy --port=8001 &
curl http://localhost:8001/api/v1/nodes
curl http://localhost:8001/api/v1/namespaces/NS/pods
curl http://localhost:8001/healthz # cluster health
curl http://localhost:8001/readyz # readiness
curl http://localhost:8001/livez # liveness
# Equivalent without proxy: kubectl get --raw /healthzkrew plugins worth installing
Install via kubectl krew install PLUGIN. These complement raw kubectl:
- stern — tail logs across pods (labels, regex, time window)
- ctx / ns — switch cluster context and namespace
- tree — visualize owner references (Deployment → ReplicaSet → Pod)
- view-secret — decode Secret data without manual base64
- neat — strip managedFields noise from YAML output
- popeye — scan cluster for misconfigurations and stale resources
Practice scenarios
Hands-on Kubernetes scenarios on live Linux VMs: kubernetes