Disaster Recovery Planning

Disaster recovery planning is the process of deciding in advance how you will respond to different types of failures. Without a plan, every incident is a chaotic emergency where you make decisions under pressure with incomplete information. With a plan, most incidents become execution of a known procedure. The key deliverable is a runbook: a step-by-step document that any qualified sysadmin can follow to recover the system, even if they have never seen it before.

RTO and RPO

Recovery planning requires two decisions before any incident:

RPO (Recovery Point Objective):
  "How much data loss is acceptable?"
  RPO = 24h → daily backups are sufficient
  RPO = 1h  → hourly backups required
  RPO = 0   → synchronous replication required (no async delay)

RTO (Recovery Time Objective):
  "How long can the service be down?"
  RTO = 24h → manual recovery from backup is acceptable
  RTO = 1h  → automated failover or warm standby required
  RTO = 5m  → hot standby with automatic failover required

These decisions determine your architecture cost:
  Long RPO/RTO = cheap (basic backups)
  Short RPO/RTO = expensive (replication, hot standbys, fast failover)

Common failure scenarios

Scenario	Detection	Recovery action
Disk failure	Disk I/O errors in dmesg, SMART alerts	Replace disk, restore from backup
Database corruption	Application errors, db error log	Restore from last clean backup
Accidental file deletion	Application 404s, user reports	Restore specific files from backup snapshot
Server compromise	IDS alert, unusual process, modified files	Isolate, forensics, rebuild from known-good
Datacenter outage	All monitoring goes dark	Failover to alternate region
Ransomware	Files encrypted, ransom note	Restore from offline backup (pre-infection)

Creating a runbook

# A runbook documents the step-by-step recovery procedure
# Store it in: version control, wiki, AND a printed copy offline

# Minimal runbook template for a web application:
# 1. ASSESSMENT
#    - What failed? (web server, database, networking, full server)
#    - Is data intact? (check backup age and integrity)
#    - What is the business impact? (who needs to be notified)

# 2. RECOVERY STEPS (example for database failure)
#    a. Provision new database server (or use standby)
#    b. Install MySQL: apt install mysql-server
#    c. Copy latest backup from /backup/mysql/
#    d. Restore: mysql < latest-backup.sql
#    e. Verify: check row counts, test application connection
#    f. Update application config with new DB host
#    g. Test end-to-end application functionality

# 3. VERIFICATION
#    - List specific checks that confirm recovery is complete
#    - Include expected output for key commands

# 4. ROLLBACK
#    - If recovery fails, what's the fallback?

DR testing

# DR testing schedule:
# Quarterly: full recovery drill (restore everything from backup to a test environment)
# Monthly: backup restore test (verify backups are restorable)
# Weekly: review and update runbooks

# Measure actual recovery times during drills:
date; echo "Starting recovery..."
# ... recovery steps ...
date; echo "Recovery complete"

# Common discovery during first DR drill:
# - Backup is older than expected (cron job failed silently)
# - Recovery procedure has steps that require manual info not documented
# - Some config files were not included in backup scope
# These discoveries are valuable — better during a drill than during a real incident

Conclusion

Write your runbook before you need it. At minimum, document: where backups are stored and how to access them, the step-by-step restore procedure for each critical system, who to notify during an incident, and how to verify that recovery is complete. Test the runbook during a planned drill — the first time you follow it should not be during a 3am emergency. Every failure that your DR plan does not cover is a SPOF in your planning, not just your infrastructure.

FAQ

Is Disaster Recovery Planning important for Ubuntu administrators?+

Yes. It supports practical Ubuntu administration because it connects directly to server reliability, security, troubleshooting, or daily operations.

Should I practice this on a live server?+

Use a lab VM first. After you understand the command output and rollback path, apply the workflow carefully on real systems.

What should I do after reading this article?+

Run the practice commands, write down what each one shows, and continue to the next article in the Ubuntu roadmap.

Need help with Ubuntu administration?

Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.

Hire Me for Support

Disaster Recovery Planning