Kubernetes Troubleshooting
Kubernetes problems concentrate in a few predictable areas: pods stuck in Pending or CrashLoopBackOff, nodes going NotReady, services not routing traffic, and PVCs stuck in Pending state. The good news is that Kubernetes generates detailed events and status conditions that tell you exactly what is wrong — if you know where to look. This guide covers the diagnostic workflow for the most common failure modes in production clusters.
Pods not starting
| Status | Cause | Fix |
|---|---|---|
| Pending | No node has enough CPU/memory | Add nodes or reduce resource requests |
| Pending | PVC not bound | Create matching PV or fix StorageClass |
| Pending | Taint/toleration mismatch | Add toleration to pod spec |
| ImagePullBackOff | Wrong image name or registry auth | Fix image tag or create imagePullSecret |
| CrashLoopBackOff | Container exits after starting | Check logs for application error |
| OOMKilled | Container exceeded memory limit | Increase memory limit |
# First step for ANY pod problem:
kubectl describe pod # Read the Events section at the bottom
kubectl describe pod events section
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 45s default-scheduler 0/2 nodes are available:
2 Insufficient memory.
# Clear message: need more memory available on nodes
# Useful diagnostic commands:
kubectl get pods -o wide # Shows which node each pod runs on
kubectl get events --sort-by='.lastTimestamp' --field-selector type=Warning
kubectl top nodes # Check if nodes are resource-saturated
kubectl describe nodes | grep -A5 "Conditions:"
CrashLoopBackOff
# CrashLoopBackOff = container keeps crashing, Kubernetes backing off restarts
# Diagnose with logs:
kubectl logs # Current container logs
kubectl logs --previous # Logs from the LAST crashed container
# Often the previous logs are more useful — they show what happened before the crash
# Example output:
# Error: MYSQL_ROOT_PASSWORD not set. Exiting.
# → Missing environment variable in pod spec
# Get a shell in a crashing container (if it starts up briefly):
kubectl exec -it -- /bin/sh
# Run an equivalent image with a shell to debug:
kubectl run debug --rm -it --image=mysql:8.0 --env="MYSQL_ROOT_PASSWORD=test" -- bash
# Check OOM kills:
kubectl describe pod | grep -i "OOMKilled\|Exit Code"
Node issues
# Check node health:
kubectl get nodes # Look for NotReady status
kubectl describe node # Read Conditions section
# Common conditions and meaning:
# Ready=False → kubelet stopped working or node is unreachable
# MemoryPressure=True → Node running low on memory (pods may be evicted)
# DiskPressure=True → Node disk full (check with df -h on the node)
# PIDPressure=True → Node running too many processes
# If DiskPressure is True on a node (very common):
# SSH into the node and find what is filling disk:
df -h
du -sh /var/lib/docker/ # Docker images and volumes
du -sh /var/log/pods/ # Pod log files
kubectl describe node Conditions section
Conditions:
Type Status Message
---- ------ -------
MemoryPressure False kubelet has sufficient memory available
DiskPressure True kubelet has disk pressure ← PROBLEM
PIDPressure False kubelet is functioning properly
Ready False Kubelet stopped posting node status.
Networking problems
# Symptom: pods cannot reach a service
# Step 1: verify the service exists and has endpoints:
kubectl get service my-service
kubectl get endpoints my-service
kubectl get endpoints output — healthy vs broken
# Healthy (has pod IPs):
NAME ENDPOINTS AGE
my-service 10.244.1.15:8080,10.244.1.23:8080 5m
# Broken (no endpoints = no pods match the service selector):
NAME ENDPOINTS AGE
my-service 5m
# Fix: check that pod labels match service selector:
kubectl get pods --show-labels
kubectl describe service my-service | grep Selector
# Test DNS resolution from inside a pod:
kubectl run dns-test --rm -it --image=alpine -- nslookup my-service
kubectl run dns-test --rm -it --image=alpine -- wget -qO- http://my-service:8080
# Check if CNI plugin is working:
kubectl get pods -n kube-system # Flannel/Calico pods should all be Running
Conclusion
The Kubernetes troubleshooting toolkit: kubectl describe for events and conditions, kubectl logs --previous for crash diagnostics, kubectl get endpoints for Service routing problems, and kubectl get events --sort-by='.lastTimestamp' for a chronological view of what happened in the cluster. DiskPressure on nodes is one of the most common production problems — set up monitoring on node disk usage and configure log rotation for pod logs (/var/log/pods/) to prevent it from accumulating silently.
FAQ
Is Kubernetes Troubleshooting important for Ubuntu administrators?+
Yes. It supports practical Ubuntu administration because it connects directly to server reliability, security, troubleshooting, or daily operations.
Should I practice this on a live server?+
Use a lab VM first. After you understand the command output and rollback path, apply the workflow carefully on real systems.
What should I do after reading this article?+
Run the practice commands, write down what each one shows, and continue to the next article in the Ubuntu roadmap.
Need help with Ubuntu administration?
Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.
Hire Me for Support