High Availability Concepts

High availability (HA) means designing systems so that a single component failure does not cause the entire service to become unavailable. HA is achieved through redundancy: no single point of failure. Understanding HA concepts is prerequisite to designing reliable infrastructure — you need to know what failure modes exist, what the recovery options are, and what the trade-offs are between availability, consistency, and cost.

HA terminology

TermDefinition
SPOFSingle Point of Failure — a component whose failure stops the entire service
FailoverAutomatic switch to a standby component when the primary fails
RTORecovery Time Objective — how long can you be down? (minutes/hours)
RPORecovery Point Objective — how much data loss is acceptable? (seconds/hours)
Active-passivePrimary handles all traffic; standby takes over only on failure
Active-activeMultiple nodes handle traffic simultaneously; any can handle any request
QuorumMajority of nodes must agree before a change is committed (prevents split-brain)

Common HA patterns

Active-Passive (simple, low cost):
  Primary ──→ serves traffic
  Standby ──→ idle, ready to take over
  Failover time: 30s-5min (VIP/DNS change)
  Cost: 2x infrastructure

Active-Active (complex, better performance):
  Node 1 ──→ serves requests
  Node 2 ──→ serves requests simultaneously
  Failover time: seconds (load balancer removes failed node)
  Challenge: data consistency between nodes

  Examples: Nginx/Apache (stateless web servers — easy)
            MySQL Galera Cluster (synchronous multi-master)
            PostgreSQL BDR (bi-directional replication)

Database HA

# MySQL HA with Orchestrator (automated failover):
# Orchestrator monitors replication topology and promotes a replica
# when the primary fails, updating all application connection strings

# MySQL Galera Cluster (active-active, synchronous):
# All nodes can accept writes, changes synchronized before commit
# Requires minimum 3 nodes (quorum)
# Add to mysqld.cnf:
# wsrep_on = ON
# wsrep_cluster_address = gcomm://node1,node2,node3

# PostgreSQL HA with Patroni:
# Patroni uses etcd/Consul/ZooKeeper for distributed consensus
# Automatic failover with <30s downtime
sudo apt install -y patroni
# Configuration in /etc/patroni/config.yml
# Simple active-passive with keepalived (VIP failover):
sudo apt install -y keepalived
sudo nano /etc/keepalived/keepalived.conf

keepalived.conf for primary node

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100             # Higher priority = preferred master
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass secretpass
    }
    virtual_ipaddress {
        192.168.1.100        # Virtual IP that follows the active node
    }
}

Measuring availability

AvailabilityAnnual downtimeDescription
99% (two nines)87.6 hoursSingle server, no HA
99.9% (three nines)8.76 hoursBasic redundancy
99.99% (four nines)52.6 minutesFull HA with fast failover
99.999% (five nines)5.26 minutesEnterprise, expensive
# Monitor uptime continuously:
sudo apt install -y nagios4    # OR use Prometheus + Alertmanager

# Simple uptime check with systemd watchdog in a service unit:
# WatchdogSec=30s
# Restart=on-failure
# RestartSec=5s

Conclusion

Start by identifying every SPOF in your architecture: single database server, single web server, single network switch. Each one is a component whose failure causes an outage. Fix the most impactful SPOFs first based on your RTO and RPO requirements. For databases, replication + automated failover (Orchestrator for MySQL, Patroni for PostgreSQL) is the practical path to 99.9%+ availability. For stateless web servers, a load balancer with health checks provides automatic failover in seconds.

FAQ

Is High Availability Concepts important for Ubuntu administrators?+

Yes. It supports practical Ubuntu administration because it connects directly to server reliability, security, troubleshooting, or daily operations.

Should I practice this on a live server?+

Use a lab VM first. After you understand the command output and rollback path, apply the workflow carefully on real systems.

What should I do after reading this article?+

Run the practice commands, write down what each one shows, and continue to the next article in the Ubuntu roadmap.

Need help with Ubuntu administration?

Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.

Hire Me for Support