Last quarter we burned three hours hunting a CrashLoopBackOff that came down to a single missing env var. The mental model below would have surfaced it in 30 seconds — and it's the same model every senior SRE keeps in their head when an alert fires at 2am.
Who this is for
- Junior SREs running their first on-call rotation
- Backend engineers shipping their first production K8s workload
- Platform engineers onboarding new teammates to incident triage
The big idea
A pod is the smallest deployable unit in Kubernetes. Every state you see in kubectl get pods — Pending, ContainerCreating, Running, CrashLoopBackOff — is just a snapshot of where the pod is on this loop. Master the loop and the failure modes start to read themselves.
How it actually starts
1. Pending — the scheduler is shopping
When you kubectl apply a Deployment, the API server writes the pod spec to etcd and the scheduler wakes up. Its job is to pick a node that satisfies the pod's resource requests, node selectors, taints, and affinity rules.
A pod sits in Pending for one of three reasons:
- No node has enough free CPU/memory to fit the request.
- A
nodeSelectorortolerationsrule excludes every available node. - A PVC the pod needs hasn't been bound yet.
kubectl describe pod my-pod | tail -20
The Events section at the bottom tells you exactly which constraint failed. Don't guess — read the events.
2. ContainerCreating — the kubelet takes over
Once a node is chosen, the kubelet on that node pulls the container image, sets up the network namespace via the CNI plugin, and mounts any volumes. This is the phase where image-pull errors and volume-mount errors surface.
kubectl get events --field-selector involvedObject.name=my-pod --sort-by=.lastTimestamp
If you're stuck in ContainerCreating for more than 30 seconds, 90% of the time it's either a private registry without an imagePullSecret or a PVC stuck in Pending because the storage class can't provision.
3. Running — but is it really ready?
Running only means the container process started. It does not mean the application inside is ready to serve traffic. That's what probes are for.
- Liveness probe — restarts the container if it stops responding. Use this for deadlock detection.
- Readiness probe — removes the pod from the Service's endpoints if it's temporarily unhealthy. Use this for slow-starting apps.
- Startup probe — gates the other two until the app finishes booting. Use this for legacy JVM apps that take 90 seconds to warm up.
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
A pod with 1/1 Running and a failing readiness probe is invisible to its Service — and that's the silent failure mode that costs the most debugging time.
4. CrashLoopBackOff — the kubelet gives up gracefully
When a container exits non-zero, the kubelet restarts it. After repeated failures, it backs off exponentially: 10s, 20s, 40s, 80s, capped at 5 minutes. That's the CrashLoopBackOff you see in kubectl get pods.
The single most useful command in your career:
kubectl logs my-pod --previous
The --previous flag fetches logs from the last crashed container, not the one currently coming up. Without it you often see an empty log because the new container hasn't printed anything yet. Pair it with kubectl describe pod and the cause is almost always staring back at you within 30 seconds.
Common failures
ImagePullBackOff
Diagnose — kubectl describe pod my-pod | grep -A3 'Failed'
Fix — The kubelet tried to pull the image and the registry refused it. Three causes account for almost every case:
- Typo in the image tag.
nginx:lastestis notnginx:latest. - Private registry, no
imagePullSecret. Create the secret withkubectl create secret docker-registry, then reference it in the pod spec'simagePullSecrets. - Rate-limited by Docker Hub. Anonymous pulls are capped at 100/6h per IP. Mirror to ECR/GCR or authenticate.
OOMKilled
Diagnose — kubectl describe pod my-pod | grep -E 'Reason|Exit Code'
Fix — The container exceeded its memory limit and the Linux kernel's OOM killer terminated it. You'll see Reason: OOMKilled and exit code 137.
The fix isn't always 'raise the limit' — sometimes the app has a memory leak or an unbounded cache. Check the trend with kubectl top pod my-pod --containers first. If memory grows linearly with traffic and never drops, you have a leak. If it spikes during requests and recovers, you need a higher limit or a request-concurrency cap.
What to read next
- Kubernetes Probes — Liveness, Readiness, Startup
- Resource Requests vs Limits — The Tax You Pay for Stability
- Official: Pod Lifecycle (kubernetes.io)