Enterprise Monitoring Stack

An enterprise monitoring stack provides full observability: metrics (what is happening now), logs (what happened and when), and alerts (when something is wrong). The modern standard is the Prometheus + Grafana stack for metrics and alerting, combined with a centralized logging system (ELK or Loki). Without comprehensive monitoring, you discover problems when users report them — with it, you know about problems before users notice.

Monitoring architecture

Enterprise monitoring architecture:

  Each server (Ubuntu node):
    Node Exporter    → system metrics (CPU, memory, disk, network)
    Application exporters → nginx, mysql, postgres metrics
    Promtail/Filebeat → log shipping to central log store

  Monitoring Server:
    Prometheus       → scrapes metrics, stores time-series data
    Alertmanager     → routes alerts to PagerDuty/Slack/email
    Grafana          → dashboards, visualization
    Loki or ELK      → log storage and search

  On-call:
    PagerDuty/OpsGenie → alert routing, escalation, on-call schedules

Prometheus + Alertmanager

# Install Node Exporter on all monitored servers:
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*linux-amd64.tar.gz
tar xzf node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter

# Create systemd service:
sudo tee /etc/systemd/system/node-exporter.service > /dev/null << 'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
User=node_exporter
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable --now node-exporter

# On the Prometheus server — add targets:
sudo nano /etc/prometheus/prometheus.yml

/etc/prometheus/prometheus.yml — scrape configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - "/etc/prometheus/rules/*.yml"

scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 'web-01:9100'
          - 'web-02:9100'
          - 'db-01:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

# Define critical alerts:
sudo nano /etc/prometheus/rules/system-alerts.yml

/etc/prometheus/rules/system-alerts.yml

groups:
  - name: system
    rules:
      - alert: HighDiskUsage
        expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage above 85% on {{ $labels.instance }}"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Memory usage above 90% on {{ $labels.instance }}"

      - alert: NodeDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is unreachable"

Centralized log aggregation

# Grafana Loki stack (lightweight, integrates with Grafana):
# Promtail on each server ships logs to central Loki instance

# Install Promtail on Ubuntu servers:
wget https://github.com/grafana/loki/releases/latest/download/promtail-linux-amd64.zip
sudo mv promtail-linux-amd64 /usr/local/bin/promtail

# Promtail config:
# /etc/promtail/config.yml:
# clients:
#   - url: http://loki-server:3100/loki/api/v1/push
# scrape_configs:
#   - job_name: syslog
#     static_configs:
#       - targets: [localhost]
#         labels:
#           job: syslog
#           host: web-01
#           __path__: /var/log/syslog
#   - job_name: nginx
#     static_configs:
#       - targets: [localhost]
#         labels:
#           job: nginx
#           __path__: /var/log/nginx/*.log

Alerting and runbooks

# Alertmanager routes alerts:
# /etc/alertmanager/alertmanager.yml:
# route:
#   receiver: 'slack-notifications'
#   group_by: ['alertname', 'instance']
#   group_wait: 30s
#   group_interval: 5m
#   repeat_interval: 4h
#   routes:
#     - match:
#         severity: critical
#       receiver: 'pagerduty'
#
# receivers:
#   - name: slack-notifications
#     slack_configs:
#       - api_url: 'https://hooks.slack.com/services/...'
#   - name: pagerduty
#     pagerduty_configs:
#       - integration_key: '...'

# Always link alerts to runbooks:
# annotations:
#   summary: "High disk usage on {{ $labels.instance }}"
#   runbook: "https://wiki.example.com/runbooks/disk-full"

Conclusion

The most actionable monitoring investment is the Node Exporter + Prometheus + Grafana stack with the official Node Exporter Full dashboard (Grafana dashboard ID 1860). This gives you immediate visibility into CPU, memory, disk, and network for all servers with minimal configuration. Add Alertmanager with a Slack or PagerDuty integration so critical alerts reach someone who can act on them — alerts that only appear in a dashboard nobody watches are not monitoring.

FAQ

Is Enterprise Monitoring Stack important for Ubuntu administrators?+

Yes. It supports practical Ubuntu administration because it connects directly to server reliability, security, troubleshooting, or daily operations.

Should I practice this on a live server?+

Use a lab VM first. After you understand the command output and rollback path, apply the workflow carefully on real systems.

What should I do after reading this article?+

Run the practice commands, write down what each one shows, and continue to the next article in the Ubuntu roadmap.

Need help with Ubuntu administration?

Work directly with Muhammad Irfan Aslam for Ubuntu Server, Linux, cloud, Docker, DevOps, CI/CD, or infrastructure troubleshooting support.

Hire Me for Support

Enterprise Monitoring Stack