Kubernetes troubleshooting

Pod stuck in Pending

Scheduler could not place the pod. kubectl describe pod POD -n NS — read Events. Causes: insufficient CPU/memory on nodes, no matching nodeSelector/affinity, PVC not bound, taints without tolerations, or cluster full. Check kubectl get nodes and kubectl describe node NODE for allocatable resources.

ImagePullBackOff / ErrImagePull

Cannot pull container image. Wrong tag, private registry without imagePullSecrets, registry down, or rate limits. Verify image name in the pod spec. Test from a node or debug pod. For private registries, create a Secret and reference it in the ServiceAccount or pod spec.

CrashLoopBackOff

Container starts, exits non-zero, kubelet restarts with backoff. Check kubectl logs POD -n NS --previous for the crashed instance. Common causes: app misconfig, missing env/Secret, wrong command, port already in use inside container, failed migrations. Use kubectl describe pod for restart count and exit codes.

OOMKilled

Container exceeded its memory limit. kubectl describe pod shows Last State: Terminated, Reason: OOMKilled. Raise resources.limits.memory or fix the leak. Node-level OOM may evict pods — check kubectl get events and node pressure conditions.

Service not reachable

Trace the path: Service selector matches pod labels? kubectl get endpoints SERVICE -n NS should list pod IPs. Test from another pod: kubectl run tmp --rm -it --image=busybox -- wget -qO- http://SERVICE.NS.svc. For Ingress issues, see the Traefik or Nginx labs. Check NetworkPolicies blocking traffic.

Forbidden / RBAC errors

kubectl auth can-i VERB RESOURCE -n NS as the current user. ServiceAccount needs a RoleBinding. Cluster-scoped resources need ClusterRole. Wrong namespace is often mistaken for RBAC — confirm kubectl config view --minify | grep namespace.

Node NotReady

Kubelet not reporting healthy. SSH to node (if allowed): check systemctl status kubelet, disk pressure, and container runtime (systemctl status containerd). Node conditions in kubectl describe node — DiskPressure, MemoryPressure, PIDPressure. Pods on NotReady nodes may be evicted after timeout.

API server slow or kubectl timeouts

On self-managed clusters, check etcd health — see the etcd lab. High API server load, etcd disk latency, or too many watches can cause timeouts. Check control-plane node resources and kubectl get --raw /healthz or curl via kubectl proxy (see debugging workflow below). Managed clusters: check cloud control-plane status.

Helm release problems

Failed installs/upgrades leave resources half-applied. Use the Helm lab for helm status, helm history, and rollback. Kubernetes object debugging still applies — kubectl get pods -n RELEASE_NS and describe failing pods.

Debugging workflow

1. Scope and events

kubectl get pods -n NS -o wide
kubectl describe pod POD -n NS
kubectl get events -n NS --sort-by=.lastTimestamp | tail -20

2. Logs (kubectl and stern)

kubectl logs POD -n NS --previous
# With krew stern plugin — all matching pods:
stern -l app=myapp -n NS --since 10m

3. kubectl exec — what the pod sees

Shell into a running container to inspect DNS, files, env, and connectivity from the pod network namespace — not from your laptop.

kubectl exec -it POD -n NS -- sh
kubectl exec POD -c CONTAINER -n NS -- cat /etc/resolv.conf
kubectl exec POD -n NS -- wget -qO- http://my-svc.my-ns.svc.cluster.local

4. Port-forward — debug connectivity locally

Tunnel a pod or Service port to localhost when Ingress, firewalls, or cluster networking make direct access hard. Test HTTP, DB clients, or debug endpoints from your machine as if the service were local.

kubectl port-forward pod/POD 8080:80 -n NS
kubectl port-forward svc/my-svc 5432:5432 -n NS
curl http://localhost:8080/health

5. Debug pods

Ephemeral toolbox pods or kubectl debug when the app image has no shell or tools (distroless, minimal images). Netshoot/busybox carry curl, dig, tcpdump, and traceroute.

# One-off debug pod in namespace
kubectl run netshoot --rm -it --image=nicolaka/netshoot -n NS -- bash

# Ephemeral debug container on existing pod (cluster must support it)
kubectl debug POD -n NS -it --image=busybox --target=CONTAINER

# Copy pod with debug sidecar (older clusters)
kubectl debug POD -n NS -it --copy-to=debug-pod --container=CONTAINER -- sh

6. Hierarchy and cluster scan

kubectl tree deploy/myapp -n NS    # krew tree
kubectl popeye -n NS                 # krew popeye — optional hygiene scan

7. curl API via kubectl proxy

Query the Kubernetes API with curl for raw JSON and health checks. kubectl proxy handles authentication locally — easiest for troubleshooting from your laptop without crafting bearer tokens.

# Let kubectl proxy handle auth — easiest for local troubleshooting
kubectl proxy --port=8001 &
curl http://localhost:8001/api/v1/nodes
curl http://localhost:8001/api/v1/namespaces/NS/pods

curl http://localhost:8001/healthz    # cluster health
curl http://localhost:8001/readyz     # readiness
curl http://localhost:8001/livez      # liveness

# Equivalent without proxy: kubectl get --raw /healthz

krew plugins worth installing

Install via kubectl krew install PLUGIN. These complement raw kubectl:

stern — tail logs across pods (labels, regex, time window)
ctx / ns — switch cluster context and namespace
tree — visualize owner references (Deployment → ReplicaSet → Pod)
view-secret — decode Secret data without manual base64
neat — strip managedFields noise from YAML output
popeye — scan cluster for misconfigurations and stale resources

Practice scenarios

Hands-on Kubernetes scenarios on live Linux VMs: kubernetes

Cheatsheet →