Complete Linux Troubleshooting Methodology

Effective Linux troubleshooting is a skill that combines systematic process with pattern recognition. Experienced sysadmins work faster not because they know more commands, but because they follow a disciplined diagnostic process: gather data before acting, understand the problem completely before proposing solutions, and test hypotheses one at a time. The most common troubleshooting mistake is applying fixes based on assumptions, which wastes time and can make the problem worse.

The troubleshooting methodology

6-step troubleshooting process:

  1. Define the problem precisely
     "The website is slow" is not a problem statement.
     "Response time increased from 200ms to 4s on /api/search at 14:30"
     is a problem statement. When did it start? What changed?

  2. Gather data (BEFORE changing anything)
     Logs, metrics, command output — capture the evidence

  3. Form hypotheses (based on data, not assumptions)
     "High CPU on mysqld + slow query log shows a missing index"

  4. Test ONE hypothesis at a time
     Adding an index tests one hypothesis without creating new variables

  5. Verify the fix worked
     Measure before and after. Response time back to 200ms? Log errors gone?

  6. Document
     What was the problem, root cause, fix, and how to prevent recurrence

Gathering data before acting

# NEVER restart services as the first troubleshooting step.
# Restarts destroy diagnostic evidence and usually bring the problem back.

# The first 5 commands on any sick server:
date               # Confirm correct time (logs are meaningless without context)
uptime             # How long running? Load average?
free -h            # Memory pressure?
df -h              # Disk space?
systemctl --failed # Any failed services?

# Gather application-specific data:

# Web server issues:
sudo tail -100 /var/log/nginx/error.log
sudo tail -100 /var/log/apache2/error.log

# Database issues:
sudo mysqladmin -u root -p status
sudo mysql -u root -p -e "SHOW PROCESSLIST"

# System-wide recent errors:
sudo journalctl -p err --since "1 hour ago"
sudo dmesg | tail -50

# What changed recently (package updates, cron jobs, user activity):
sudo last | head -20    # Recent logins
sudo tail -50 /var/log/auth.log    # Authentication events
sudo tail -50 /var/log/dpkg.log    # Package install/remove history

Essential diagnostic commands

Category	Command	What it shows
Processes	ps aux --sort=-%cpu	Top CPU consumers
Processes	ps aux --sort=-%mem	Top memory consumers
Memory	free -h	RAM and swap usage
Memory	vmstat 1 5	CPU vs I/O wait, swap activity
Disk	df -h	Filesystem usage
Disk	iostat -x 1	Disk I/O throughput and utilization
Network	ss -tlnp	Listening TCP ports and processes
Network	netstat -s	Network statistics (errors, drops)
Logs	journalctl -p err	Error-level systemd journal entries
Logs	dmesg -T	Kernel messages with timestamps
Services	systemctl --failed	All failed systemd units
Files	lsof +D /path	Open files in a directory

# The USE method (Utilization, Saturation, Errors) — systematic resource check:
# For each resource (CPU, memory, disk, network), check:

# Utilization (how busy is it?):
top    # CPU utilization
free -h    # Memory utilization
df -h    # Disk utilization

# Saturation (is there a queue of waiting work?):
uptime    # CPU: load average > number of CPUs = saturated
vmstat | awk '{print $6}'    # Memory: swap-in > 0 = saturated
iostat -x | awk '{print $NF}'    # Disk: %util > 80% = saturated

# Errors (are there failures happening?):
dmesg | grep -i "error\|fail"
journalctl -p err --since "1 hour ago"
netstat -s | grep -i "error\|fail"

When to escalate

# Know when to stop and escalate rather than keep guessing:

# Escalate when:
# - Data gathering shows nothing abnormal but problem persists
# - You have tested 3+ hypotheses with no change in symptoms
# - The problem is in a system you do not administer (upstream provider, CDN, DNS)
# - Data points to hardware failure (disk SMART errors, memory ECC errors, thermal)
# - You cannot explain the root cause from the evidence

# Hardware diagnosis:
sudo smartctl -a /dev/sda    # Disk health (SMART data)
sudo edac-util -s 4          # Memory ECC errors
sudo ipmitool sdr            # IPMI sensor readings (temperature, voltage)

# Post-incident documentation template:
# Incident: nginx 502 errors affecting 30% of requests
# Duration: 14:30 - 15:05 (35 minutes)
# Impact: ~30% of API requests failed with 502
# Root cause: MySQL slow queries caused PHP-FPM pool exhaustion
#   → Deployment at 14:25 introduced a query without an index on orders table
#   → Query time: 0.2ms → 8 seconds on table with 2M rows
# Fix: Added index: ALTER TABLE orders ADD INDEX idx_customer (customer_id)
# Verify: Response time returned to 180ms average (from 4200ms peak)
# Prevention:
#   1. Added EXPLAIN check to CI pipeline for new queries
#   2. Added MySQL slow query alert (>1s) to monitoring

Conclusion

The single most valuable troubleshooting habit is writing down what you observe before you start making changes. Even a few lines in a terminal window or notes app — "14:35: high CPU on mysqld, load 12.4, slow query log shows orders table query at 8s" — organizes your thinking, prevents you from forgetting what you already checked, and becomes the foundation of the post-incident documentation. Troubleshooting without notes is troubleshooting with a short memory in a high-stress situation: the worst combination.

FAQ

Why should administrators understand Complete Linux Troubleshooting Methodology?+

Because this topic affects planning decisions, server lifecycle, compatibility, support expectations, or how you reason about Ubuntu systems before making operational changes.

Do I need a lab for this topic?+

A lab is useful for checking commands and seeing the concept on a real Ubuntu machine, but the main value is understanding the decision, tradeoff, or system behavior clearly.

How should I use this knowledge in production?+

Use it to make better choices, document why those choices were made, and avoid rushed changes that ignore support windows, compatibility, stability, or operational risk.

Need help with Ubuntu administration?

Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.

Hire Me for Support

Complete Linux Troubleshooting Methodology