Last quarter we burned three hours hunting a CrashLoopBackOff that came down to a single missing env var. The mental model below would have surfaced it in 30 seconds — and it's the same model every senior SRE keeps in their head when an alert fires at 2am.

Who this is for

Junior SREs running their first on-call rotation
Backend engineers shipping their first production K8s workload
Platform engineers onboarding new teammates to incident triage

The big idea

Mermaid

A pod is the smallest deployable unit in Kubernetes. Every state you see in kubectl get pods — Pending, ContainerCreating, Running, CrashLoopBackOff — is just a snapshot of where the pod is on this loop. Master the loop and the failure modes start to read themselves.

How it actually starts

1. Pending — the scheduler is shopping

When you kubectl apply a Deployment, the API server writes the pod spec to etcd and the scheduler wakes up. Its job is to pick a node that satisfies the pod's resource requests, node selectors, taints, and affinity rules.

A pod sits in Pending for one of three reasons:

No node has enough free CPU/memory to fit the request.
A nodeSelector or tolerations rule excludes every available node.
A PVC the pod needs hasn't been bound yet.

kubectl describe pod my-pod | tail -20

The Events section at the bottom tells you exactly which constraint failed. Don't guess — read the events.

2. ContainerCreating — the kubelet takes over

Once a node is chosen, the kubelet on that node pulls the container image, sets up the network namespace via the CNI plugin, and mounts any volumes. This is the phase where image-pull errors and volume-mount errors surface.

kubectl get events --field-selector involvedObject.name=my-pod --sort-by=.lastTimestamp

If you're stuck in ContainerCreating for more than 30 seconds, 90% of the time it's either a private registry without an imagePullSecret or a PVC stuck in Pending because the storage class can't provision.

3. Running — but is it really ready?

Running only means the container process started. It does not mean the application inside is ready to serve traffic. That's what probes are for.

Liveness probe — restarts the container if it stops responding. Use this for deadlock detection.
Readiness probe — removes the pod from the Service's endpoints if it's temporarily unhealthy. Use this for slow-starting apps.
Startup probe — gates the other two until the app finishes booting. Use this for legacy JVM apps that take 90 seconds to warm up.

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

A pod with 1/1 Running and a failing readiness probe is invisible to its Service — and that's the silent failure mode that costs the most debugging time.

4. CrashLoopBackOff — the kubelet gives up gracefully

When a container exits non-zero, the kubelet restarts it. After repeated failures, it backs off exponentially: 10s, 20s, 40s, 80s, capped at 5 minutes. That's the CrashLoopBackOff you see in kubectl get pods.

The single most useful command in your career:

kubectl logs my-pod --previous

The --previous flag fetches logs from the last crashed container, not the one currently coming up. Without it you often see an empty log because the new container hasn't printed anything yet. Pair it with kubectl describe pod and the cause is almost always staring back at you within 30 seconds.

Common failures

ImagePullBackOff

Diagnose — kubectl describe pod my-pod | grep -A3 'Failed'

Fix — The kubelet tried to pull the image and the registry refused it. Three causes account for almost every case:

Typo in the image tag. nginx:lastest is not nginx:latest.
Private registry, no imagePullSecret. Create the secret with kubectl create secret docker-registry, then reference it in the pod spec's imagePullSecrets.
Rate-limited by Docker Hub. Anonymous pulls are capped at 100/6h per IP. Mirror to ECR/GCR or authenticate.

OOMKilled

Diagnose — kubectl describe pod my-pod | grep -E 'Reason|Exit Code'

Fix — The container exceeded its memory limit and the Linux kernel's OOM killer terminated it. You'll see Reason: OOMKilled and exit code 137.

The fix isn't always 'raise the limit' — sometimes the app has a memory leak or an unbounded cache. Check the trend with kubectl top pod my-pod --containers first. If memory grows linearly with traffic and never drops, you have a leak. If it spikes during requests and recovers, you need a higher limit or a request-concurrency cap.

What to read next

Discuss with the twinHow would you debug a pod that's been stuck in Pending for 10 minutes on a 50-node cluster?