High Availability

High Availability

Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations.

Design Goals

The Alertmanager HA implementation is designed around three core principles:

  1. Single pane view and management - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience
  2. Survive cluster split-brain with "fail open" - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts
  3. At-least-once delivery - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy

These goals prioritize operational reliability and alert delivery over strict exactly-once semantics.

Architecture Overview

An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance:

  • Receives alerts independently from Prometheus servers
  • Participates in a peer-to-peer gossip mesh
  • Replicates state (silences and notification log) to other cluster members
  • Processes and sends notifications independently
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Prometheus 1 │    │ Prometheus 2 │    │ Prometheus N │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       │ alerts            │ alerts            │ alerts
       │                   │                   │
       ▼                   ▼                   ▼
    ┌────────────────────────────────────────────┐
    │  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
    │  │  AM-1    │  │  AM-2    │  │  AM-3    │  │
    │  │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │  │
    │  └──────────┘  └──────────┘  └──────────┘  │
    │          Gossip Protocol (Memberlist)      │
    └────────────────────────────────────────────┘
              │           │           │
              ▼           ▼           ▼
         Receivers   Receivers   Receivers

Gossip Protocol

Alertmanager uses Hashicorp's Memberlist  library to implement gossip-based communication. The gossip protocol handles:

Membership Management

  • Automatic peer discovery - Instances can be configured with a list of known peers and will automatically discover other cluster members
  • Health checking - Regular probes detect failed members (default: every 1 second)
  • Failure detection - Failed members are marked and can attempt to rejoin

State Replication

The gossip layer replicates three types of state:

  1. Silences - Create, update, and delete operations are broadcast to all peers
  2. Notification log - Records of which notifications were sent to prevent duplicates
  3. Membership changes - Join, leave, and failure events

State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity.

Gossip Settling

When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state.

The settling algorithm waits until:

  • The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval)
  • Or a timeout occurs (configurable via context)

During this time, the instance already receives and stores alerts but defers notification processing.

Notification Pipeline in HA Mode

The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery:

┌────────────────────────────────────────────────┐
│              DISPATCHER STAGE                  │
├────────────────────────────────────────────────┤
│ 1. Find matching route(s)                      │
│ 2. Find/create aggregation group within route  │
│ 3. Throttle by group wait or group interval    │
└───────────────────┬────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────────┐
│               NOTIFIER STAGE                   │
├────────────────────────────────────────────────┤
│ 1. Wait for HA gossip to settle                │◄─── Ensures complete state
│ 2. Filter inhibited alerts                     │
│ 3. Filter non-time-active alerts               │
│ 4. Filter time-muted alerts                    │
│ 5. Filter silenced alerts                      │◄─── Uses replicated silences
│ 6. Wait according to HA cluster peer index     │◄─── Staggered notifications
│ 7. Dedupe by repeat interval/HA state          │◄─── Uses notification log
│ 8. Notify & retry intermittent failures        │
│ 9. Update notification log                     │◄─── Replicated to peers
└────────────────────────────────────────────────┘

HA-Specific Stages

1. Gossip Settling Wait

Before the first notification from a group, the instance waits for gossip to settle. This ensures:

  • Silences are fully replicated
  • The notification log contains recent send records from other instances
  • The cluster membership is stable

Implementation: peer.WaitReady(ctx)

2. Peer Position-Based Wait

To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list:

wait_time = peer_position × peer_timeout

For example, with 3 instances and a 15-second peer timeout:

  • Instance am-1 (position 0): waits 0 seconds
  • Instance am-2 (position 1): waits 15 seconds
  • Instance am-3 (position 2): waits 30 seconds

This staggered timing allows:

  • The first instance to send the notification
  • Subsequent instances to see the notification log entry
  • Deduplication to prevent duplicate sends

Implementation: clusterWait() in cmd/alertmanager/main.go:594

Position is determined by sorting all peer names alphabetically:

func (p *Peer) Position() int {
    all := p.mlist.Members()
    sort.Slice(all, func(i, j int) bool {
        return all[i].Name < all[j].Name
    })
    // Find position of self in sorted list
}

3. Deduplication via Notification Log

The DedupStage queries the notification log to determine if a notification should be sent:

// Check notification log for recent sends
entry := nflog.Query(receiver, groupKey)
if entry.exists && !shouldNotify(entry, alerts, repeatInterval) {
    // Skip: already notified recently
    return nil
}

Deduplication checks:

  • Firing alerts changed? If yes, notify
  • Resolved alerts changed? If yes and send_resolved: true, notify
  • Repeat interval elapsed? If yes, notify
  • Otherwise: Skip notification (deduplicated)

The notification log is replicated via gossip, so all cluster members share the same send history.

Split-Brain Handling (Fail Open)

During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered:

Scenario: Network Partition

Before partition:
┌────────┬────────┬────────┐
│  AM-1  │  AM-2  │  AM-3  │
└────────┴────────┴────────┘
    Unified cluster

After partition:
┌────────┐       │       ┌────────┬────────┐
│  AM-1  │       │       │  AM-2  │  AM-3  │
└────────┘       │       └────────┴────────┘
 Partition A     │        Partition B

Behavior During Partition

In Partition A (AM-1 alone):

  • AM-1 sees itself as position 0
  • Waits 0 × timeout = 0 seconds
  • Sends notifications (no dedup from AM-2/AM-3)

In Partition B (AM-2, AM-3):

  • AM-2 is position 0, AM-3 is position 1
  • AM-2 waits 0 seconds, sends notification
  • AM-3 sees AM-2's notification log entry, deduplicates

Result: Duplicate notifications sent (one from Partition A, one from Partition B)

This is intentional - Alertmanager prefers duplicate notifications over missed alerts.

After Partition Heals

When the network partition heals:

  1. Gossip protocol detects all peers again
  2. Notification logs are merged (via CRDT-like merge with timestamp)
  3. Future notifications are deduplicated correctly across all instances
  4. Silences created in either partition are replicated to all peers

Silence Management in HA

Silences are first-class replicated state in the cluster.

Silence Creation and Updates

When a silence is created or updated on any instance:

  1. Local storage - Silence is stored in the local state map
  2. Broadcast - Silence is serialized (protobuf) and broadcast via gossip
  3. Merge on receive - Other instances receive and merge the silence:
    // Merge logic: last-write-wins based on UpdatedAt timestamp
    if !exists || incoming.UpdatedAt > existing.UpdatedAt {
        accept_update()
    }
  4. Indexing - The silence matcher cache is updated for fast alert matching

Silence Expiry

Silences have:

  • StartsAt, EndsAt - The active time range
  • ExpiresAt - When to garbage collect (EndsAt + retention period)
  • UpdatedAt - For conflict resolution during merge

Each instance independently:

  • Evaluates silence state (pending/active/expired) based on current time
  • Garbage collects expired silences past their retention period
  • The GC is local only (no gossip) since all instances converge to the same decision

Single Pane of Glass

Users can interact with any Alertmanager instance in the cluster:

  • View silences - All instances have the same silence state (eventually consistent)
  • Create/update silences - Changes made on any instance propagate to all peers
  • Delete silences - Implemented as "expire immediately" + gossip

This provides a unified operational experience regardless of which instance you access.

Operational Considerations

Configuration

To configure a cluster, each Alertmanager instance needs:

# alertmanager.yml
global:
  # ... other config ...

# No cluster config in YAML - use CLI flags

Command-line flags:

alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=am-1.example.com:9094 \
  --cluster.peer=am-2.example.com:9094 \
  --cluster.peer=am-3.example.com:9094 \
  --cluster.advertise-address=$(hostname):9094 \
  --cluster.peer-timeout=15s \
  --cluster.gossip-interval=200ms \
  --cluster.pushpull-interval=60s

Key flags:

  • --cluster.listen-address - Bind address for cluster communication (default: 0.0.0.0:9094)
  • --cluster.peer - List of peer addresses (can be repeated)
  • --cluster.advertise-address - Address advertised to peers (auto-detected if omitted)
  • --cluster.peer-timeout - Wait time per peer position for deduplication (default: 15s)
  • --cluster.gossip-interval - How often to gossip (default: 200ms)
  • --cluster.pushpull-interval - Full state sync interval (default: 60s)
  • --cluster.probe-interval - Peer health check interval (default: 1s)
  • --cluster.settle-timeout - Max time to wait for gossip settling (default: context timeout)

Prometheus Configuration

Important: Configure Prometheus to send alerts to all Alertmanager instances, not via a load balancer.

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - am-1.example.com:9093
            - am-2.example.com:9093
            - am-3.example.com:9093

This ensures:

  • Redundancy - If one Alertmanager is down, others still receive alerts
  • Independent processing - Each instance independently evaluates routing, grouping, and deduplication
  • No single point of failure - Load balancers introduce a single point of failure

Cluster Size Considerations

Since Alertmanager uses gossip without quorum or voting, any N instances tolerate up to N-1 failures - as long as one instance is alive, notifications will be sent.

However, cluster size involves tradeoffs:

Benefits of more instances:

  • Greater resilience to simultaneous failures (hardware, network, datacenter outages)
  • Continued operation even during maintenance windows

Costs of more instances:

  • In case of partitions there will be an increase in duplicate notifications
  • More gossip traffic

Typical deployments:

  • 2-3 instances - Common for single-datacenter production deployments
  • 4-5 instances - Multi-datacenter or highly critical environments

Note: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum.

Monitoring Cluster Health

Key metrics to monitor:

# Cluster size
alertmanager_cluster_members

# Peer health
alertmanager_cluster_peer_info

# Peer position (affects notification timing)
alertmanager_peer_position

# Failed peers
alertmanager_cluster_failed_peers

# State replication
alertmanager_nflog_gossip_messages_propagated_total
alertmanager_silences_gossip_messages_propagated_total

Security

By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS:

alertmanager \
  --cluster.tls-config=/etc/alertmanager/cluster-tls.yml

See Secure Cluster Traffic for details.

Persistence

Each Alertmanager instance persists:

  • Silences - Stored in a snapshot file (default: data/silences)
  • Notification log - Stored in a snapshot file (default: data/nflog)

On restart:

  1. Instance loads silences and notification log from disk
  2. Joins the cluster and gossips with peers
  3. Merges state received from peers (newer timestamps win)
  4. Begins processing notifications after gossip settling

Note: Alerts themselves are not persisted - Prometheus re-sends firing alerts regularly.

Common Pitfalls

  1. Load balancing Prometheus → Alertmanager

    • ❌ Don't use a load balancer
    • ✅ Configure all instances in Prometheus
  2. Not waiting for gossip to settle

    • Can lead to missed silences or duplicate notifications on startup
    • The --cluster.settle-timeout flag controls this
  3. Network ACLs blocking cluster port

    • Ensure port 9094 (or your --cluster.listen-address port) is open between all instances
    • Both TCP and UDP are used by default (TCP only if using TLS transport)
  4. Unroutable advertise addresses

    • If --cluster.advertise-address is not set, Alertmanager tries to auto-detect
    • For cloud/NAT environments, explicitly set a routable address
  5. Mismatched cluster configurations

    • All instances should have the same --cluster.peer-timeout and gossip settings
    • Mismatches can cause unnecessary duplicates or missed notifications

How It Works: End-to-End Example

Scenario: 3-instance cluster, new alert group

  1. Alert arrives at all 3 instances from Prometheus
  2. Dispatcher creates aggregation group, waits group_wait (e.g., 30s)
  3. After group_wait:
    • Each instance prepares to notify
  4. Notifier stage:
    • All instances wait for gossip to settle (if just started)
    • AM-1 (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog
    • AM-2 (position 1): waits 15s, checks notification log (sees AM-1's entry), skips notification
    • AM-3 (position 2): waits 30s, checks notification log (sees AM-1's entry), skips notification
  5. Result: Exactly one notification sent (by AM-1)

Scenario: AM-1 fails

  1. Alert arrives at AM-2 and AM-3 only
  2. Dispatcher creates group, waits group_wait
  3. Notifier stage:
    • AM-1 is not in cluster (failed probe)
    • AM-2 is now position 0: waits 0s, sends notification
    • AM-3 is now position 1: waits 15s, sees AM-2's entry, skips
  4. Result: Notification still sent (fail-open)

Scenario: Network partition during notification

  1. Alert arrives at all instances
  2. Network partition splits AM-1 from AM-2/AM-3
  3. In partition A (AM-1):
    • Position 0, waits 0s, sends notification
  4. In partition B (AM-2, AM-3):
    • AM-2 is position 0, waits 0s, sends notification
    • AM-3 is position 1, waits 15s, deduplicates
  5. Result: Two notifications sent (one per partition) - fail-open behavior

Troubleshooting

Check cluster status

# View cluster members via API
curl http://am-1:9093/api/v2/status

# Check metrics
curl http://am-1:9093/metrics | grep cluster

Diagnose split-brain

If you suspect split-brain:

  1. Check alertmanager_cluster_members on each instance
    • Should match total cluster size
  2. Check alertmanager_cluster_peer_info{state="alive"}
    • Should show all peers as alive
  3. Review network connectivity between instances

Debug duplicate notifications

Duplicate notifications can occur due to:

  1. Network partitions (expected, fail-open)
  2. Gossip not settled - Check --cluster.settle-timeout
  3. Clock skew - Ensure NTP is configured on all instances
  4. Notification log not replicating - Check gossip metrics

Enable debug logging:

alertmanager --log.level=debug

Look for:

  • "Waiting for gossip to settle..."
  • "gossip settled; proceeding"
  • Deduplication decisions in notification pipeline

Further Reading

On this page