Disaster Recovery Planning

Disaster recovery planning is the process of deciding in advance how you will respond to different types of failures. Without a plan, every incident is a chaotic emergency where you make decisions under pressure with incomplete information. With a plan, most incidents become execution of a known procedure. The key deliverable is a runbook: a step-by-step document that any qualified sysadmin can follow to recover the system, even if they have never seen it before.

RTO and RPO

Recovery planning requires two decisions before any incident:

RPO (Recovery Point Objective):
  "How much data loss is acceptable?"
  RPO = 24h → daily backups are sufficient
  RPO = 1h  → hourly backups required
  RPO = 0   → synchronous replication required (no async delay)

RTO (Recovery Time Objective):
  "How long can the service be down?"
  RTO = 24h → manual recovery from backup is acceptable
  RTO = 1h  → automated failover or warm standby required
  RTO = 5m  → hot standby with automatic failover required

These decisions determine your architecture cost:
  Long RPO/RTO = cheap (basic backups)
  Short RPO/RTO = expensive (replication, hot standbys, fast failover)

Common failure scenarios

ScenarioDetectionRecovery action
Disk failureDisk I/O errors in dmesg, SMART alertsReplace disk, restore from backup
Database corruptionApplication errors, db error logRestore from last clean backup
Accidental file deletionApplication 404s, user reportsRestore specific files from backup snapshot
Server compromiseIDS alert, unusual process, modified filesIsolate, forensics, rebuild from known-good
Datacenter outageAll monitoring goes darkFailover to alternate region
RansomwareFiles encrypted, ransom noteRestore from offline backup (pre-infection)

Creating a runbook

# A runbook documents the step-by-step recovery procedure
# Store it in: version control, wiki, AND a printed copy offline

# Minimal runbook template for a web application:
# 1. ASSESSMENT
#    - What failed? (web server, database, networking, full server)
#    - Is data intact? (check backup age and integrity)
#    - What is the business impact? (who needs to be notified)

# 2. RECOVERY STEPS (example for database failure)
#    a. Provision new database server (or use standby)
#    b. Install MySQL: apt install mysql-server
#    c. Copy latest backup from /backup/mysql/
#    d. Restore: mysql < latest-backup.sql
#    e. Verify: check row counts, test application connection
#    f. Update application config with new DB host
#    g. Test end-to-end application functionality

# 3. VERIFICATION
#    - List specific checks that confirm recovery is complete
#    - Include expected output for key commands

# 4. ROLLBACK
#    - If recovery fails, what's the fallback?

DR testing

# DR testing schedule:
# Quarterly: full recovery drill (restore everything from backup to a test environment)
# Monthly: backup restore test (verify backups are restorable)
# Weekly: review and update runbooks

# Measure actual recovery times during drills:
date; echo "Starting recovery..."
# ... recovery steps ...
date; echo "Recovery complete"

# Common discovery during first DR drill:
# - Backup is older than expected (cron job failed silently)
# - Recovery procedure has steps that require manual info not documented
# - Some config files were not included in backup scope
# These discoveries are valuable — better during a drill than during a real incident

Conclusion

Write your runbook before you need it. At minimum, document: where backups are stored and how to access them, the step-by-step restore procedure for each critical system, who to notify during an incident, and how to verify that recovery is complete. Test the runbook during a planned drill — the first time you follow it should not be during a 3am emergency. Every failure that your DR plan does not cover is a SPOF in your planning, not just your infrastructure.

FAQ

Is Disaster Recovery Planning important for Ubuntu administrators?+

Yes. It supports practical Ubuntu administration because it connects directly to server reliability, security, troubleshooting, or daily operations.

Should I practice this on a live server?+

Use a lab VM first. After you understand the command output and rollback path, apply the workflow carefully on real systems.

What should I do after reading this article?+

Run the practice commands, write down what each one shows, and continue to the next article in the Ubuntu roadmap.

Need help with Ubuntu administration?

Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.

Hire Me for Support