Kubernetes Troubleshooting

Kubernetes problems concentrate in a few predictable areas: pods stuck in Pending or CrashLoopBackOff, nodes going NotReady, services not routing traffic, and PVCs stuck in Pending state. The good news is that Kubernetes generates detailed events and status conditions that tell you exactly what is wrong — if you know where to look. This guide covers the diagnostic workflow for the most common failure modes in production clusters.

Pods not starting

Status	Cause	Fix
Pending	No node has enough CPU/memory	Add nodes or reduce resource requests
Pending	PVC not bound	Create matching PV or fix StorageClass
Pending	Taint/toleration mismatch	Add toleration to pod spec
ImagePullBackOff	Wrong image name or registry auth	Fix image tag or create imagePullSecret
CrashLoopBackOff	Container exits after starting	Check logs for application error
OOMKilled	Container exceeded memory limit	Increase memory limit

# First step for ANY pod problem:
kubectl describe pod     # Read the Events section at the bottom

kubectl describe pod events section

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  45s   default-scheduler  0/2 nodes are available:
                                                      2 Insufficient memory.
# Clear message: need more memory available on nodes

# Useful diagnostic commands:
kubectl get pods -o wide    # Shows which node each pod runs on
kubectl get events --sort-by='.lastTimestamp' --field-selector type=Warning
kubectl top nodes            # Check if nodes are resource-saturated
kubectl describe nodes | grep -A5 "Conditions:"

CrashLoopBackOff

# CrashLoopBackOff = container keeps crashing, Kubernetes backing off restarts
# Diagnose with logs:
kubectl logs                     # Current container logs
kubectl logs  --previous         # Logs from the LAST crashed container

# Often the previous logs are more useful — they show what happened before the crash
# Example output:
# Error: MYSQL_ROOT_PASSWORD not set. Exiting.
# → Missing environment variable in pod spec

# Get a shell in a crashing container (if it starts up briefly):
kubectl exec -it  -- /bin/sh

# Run an equivalent image with a shell to debug:
kubectl run debug --rm -it --image=mysql:8.0 --env="MYSQL_ROOT_PASSWORD=test" -- bash

# Check OOM kills:
kubectl describe pod  | grep -i "OOMKilled\|Exit Code"

Node issues

# Check node health:
kubectl get nodes    # Look for NotReady status
kubectl describe node     # Read Conditions section

# Common conditions and meaning:
# Ready=False         → kubelet stopped working or node is unreachable
# MemoryPressure=True → Node running low on memory (pods may be evicted)
# DiskPressure=True   → Node disk full (check with df -h on the node)
# PIDPressure=True    → Node running too many processes

# If DiskPressure is True on a node (very common):
# SSH into the node and find what is filling disk:
df -h
du -sh /var/lib/docker/    # Docker images and volumes
du -sh /var/log/pods/      # Pod log files

kubectl describe node Conditions section

Conditions:
  Type             Status  Message
  ----             ------  -------
  MemoryPressure   False   kubelet has sufficient memory available
  DiskPressure     True    kubelet has disk pressure    ← PROBLEM
  PIDPressure      False   kubelet is functioning properly
  Ready            False   Kubelet stopped posting node status.

Networking problems

# Symptom: pods cannot reach a service
# Step 1: verify the service exists and has endpoints:
kubectl get service my-service
kubectl get endpoints my-service

kubectl get endpoints output — healthy vs broken

# Healthy (has pod IPs):
NAME         ENDPOINTS                         AGE
my-service   10.244.1.15:8080,10.244.1.23:8080   5m

# Broken (no endpoints = no pods match the service selector):
NAME         ENDPOINTS   AGE
my-service         5m
# Fix: check that pod labels match service selector:
kubectl get pods --show-labels
kubectl describe service my-service | grep Selector

# Test DNS resolution from inside a pod:
kubectl run dns-test --rm -it --image=alpine -- nslookup my-service
kubectl run dns-test --rm -it --image=alpine -- wget -qO- http://my-service:8080

# Check if CNI plugin is working:
kubectl get pods -n kube-system    # Flannel/Calico pods should all be Running

Conclusion

The Kubernetes troubleshooting toolkit: kubectl describe for events and conditions, kubectl logs --previous for crash diagnostics, kubectl get endpoints for Service routing problems, and kubectl get events --sort-by='.lastTimestamp' for a chronological view of what happened in the cluster. DiskPressure on nodes is one of the most common production problems — set up monitoring on node disk usage and configure log rotation for pod logs (/var/log/pods/) to prevent it from accumulating silently.

FAQ

Is Kubernetes Troubleshooting important for Ubuntu administrators?+

Yes. It supports practical Ubuntu administration because it connects directly to server reliability, security, troubleshooting, or daily operations.

Should I practice this on a live server?+

Use a lab VM first. After you understand the command output and rollback path, apply the workflow carefully on real systems.

What should I do after reading this article?+

Run the practice commands, write down what each one shows, and continue to the next article in the Ubuntu roadmap.

Need help with Ubuntu administration?

Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.

Hire Me for Support

Kubernetes Troubleshooting