High Availability

Alertmanager supports configuration to create a cluster for high availability. This document describes how the HA mechanism works, its design goals, and operational considerations.

Design Goals

The Alertmanager HA implementation is designed around three core principles:

Single pane view and management - Silences and alerts can be viewed and managed from any cluster member, providing a unified operational experience
Survive cluster split-brain with "fail open" - During network partitions, Alertmanager prefers to send duplicate notifications rather than miss critical alerts
At-least-once delivery - The system guarantees that notifications are delivered at least once, in line with the fail-open philosophy

These goals prioritize operational reliability and alert delivery over strict exactly-once semantics.

Architecture Overview

An Alertmanager cluster consists of multiple Alertmanager instances that communicate using a gossip protocol. Each instance:

Receives alerts independently from Prometheus servers
Participates in a peer-to-peer gossip mesh
Replicates state (silences and notification log) to other cluster members
Processes and sends notifications independently

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Prometheus 1 │    │ Prometheus 2 │    │ Prometheus N │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       │ alerts            │ alerts            │ alerts
       │                   │                   │
       ▼                   ▼                   ▼
    ┌────────────────────────────────────────────┐
    │  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
    │  │  AM-1    │  │  AM-2    │  │  AM-3    │  │
    │  │ (pos: 0) ├──┤ (pos: 1) ├──┤ (pos: 2) │  │
    │  └──────────┘  └──────────┘  └──────────┘  │
    │          Gossip Protocol (Memberlist)      │
    └────────────────────────────────────────────┘
              │           │           │
              ▼           ▼           ▼
         Receivers   Receivers   Receivers

Gossip Protocol

Alertmanager uses Hashicorp's Memberlist library to implement gossip-based communication. The gossip protocol handles:

Membership Management

Automatic peer discovery - Instances can be configured with a list of known peers and will automatically discover other cluster members
Health checking - Regular probes detect failed members (default: every 1 second)
Failure detection - Failed members are marked and can attempt to rejoin

State Replication

The gossip layer replicates three types of state:

Silences - Create, update, and delete operations are broadcast to all peers
Notification log - Records of which notifications were sent to prevent duplicates
Membership changes - Join, leave, and failure events

State is eventually consistent - all cluster members will converge to the same state given sufficient time and network connectivity.

Gossip Settling

When an Alertmanager starts or rejoins the cluster, it waits for gossip to "settle" before processing notifications. This prevents sending notifications based on incomplete state.

The settling algorithm waits until:

The number of peers remains stable for 3 consecutive checks (default interval: push-pull interval)
Or a timeout occurs (configurable via context)

During this time, the instance already receives and stores alerts but defers notification processing.

Notification Pipeline in HA Mode

The notification pipeline operates differently in a clustered environment to ensure deduplication while maintaining at-least-once delivery:

┌────────────────────────────────────────────────┐
│              DISPATCHER STAGE                  │
├────────────────────────────────────────────────┤
│ 1. Find matching route(s)                      │
│ 2. Find/create aggregation group within route  │
│ 3. Throttle by group wait or group interval    │
└───────────────────┬────────────────────────────┘
                    │
                    ▼
┌────────────────────────────────────────────────┐
│               NOTIFIER STAGE                   │
├────────────────────────────────────────────────┤
│ 1. Wait for HA gossip to settle                │◄─── Ensures complete state
│ 2. Filter inhibited alerts                     │
│ 3. Filter non-time-active alerts               │
│ 4. Filter time-muted alerts                    │
│ 5. Filter silenced alerts                      │◄─── Uses replicated silences
│ 6. Wait according to HA cluster peer index     │◄─── Staggered notifications
│ 7. Dedupe by repeat interval/HA state          │◄─── Uses notification log
│ 8. Notify & retry intermittent failures        │
│ 9. Update notification log                     │◄─── Replicated to peers
└────────────────────────────────────────────────┘

HA-Specific Stages

1. Gossip Settling Wait

Before the first notification from a group, the instance waits for gossip to settle. This ensures:

Silences are fully replicated
The notification log contains recent send records from other instances
The cluster membership is stable

Implementation: peer.WaitReady(ctx)

2. Peer Position-Based Wait

To prevent all cluster members from sending notifications simultaneously, each instance waits based on its position in the sorted peer list:

wait_time = peer_position × peer_timeout

For example, with 3 instances and a 15-second peer timeout:

Instance am-1 (position 0): waits 0 seconds
Instance am-2 (position 1): waits 15 seconds
Instance am-3 (position 2): waits 30 seconds

This staggered timing allows:

The first instance to send the notification
Subsequent instances to see the notification log entry
Deduplication to prevent duplicate sends

Implementation: clusterWait() in cmd/alertmanager/main.go:594

Position is determined by sorting all peer names alphabetically:

func (p *Peer) Position() int {
    all := p.mlist.Members()
    sort.Slice(all, func(i, j int) bool {
        return all[i].Name < all[j].Name
    })
    // Find position of self in sorted list
}

3. Deduplication via Notification Log

The DedupStage queries the notification log to determine if a notification should be sent:

// Check notification log for recent sends
entry := nflog.Query(receiver, groupKey)
if entry.exists && !shouldNotify(entry, alerts, repeatInterval) {
    // Skip: already notified recently
    return nil
}

Deduplication checks:

Firing alerts changed? If yes, notify
Resolved alerts changed? If yes and send_resolved: true, notify
Repeat interval elapsed? If yes, notify
Otherwise: Skip notification (deduplicated)

The notification log is replicated via gossip, so all cluster members share the same send history.

Split-Brain Handling (Fail Open)

During a network partition, the cluster may split into multiple groups that cannot communicate. Alertmanager's "fail open" design ensures alerts are still delivered:

Scenario: Network Partition

Before partition:
┌────────┬────────┬────────┐
│  AM-1  │  AM-2  │  AM-3  │
└────────┴────────┴────────┘
    Unified cluster

After partition:
┌────────┐       │       ┌────────┬────────┐
│  AM-1  │       │       │  AM-2  │  AM-3  │
└────────┘       │       └────────┴────────┘
 Partition A     │        Partition B

Behavior During Partition

In Partition A (AM-1 alone):

AM-1 sees itself as position 0
Waits 0 × timeout = 0 seconds
Sends notifications (no dedup from AM-2/AM-3)

In Partition B (AM-2, AM-3):

AM-2 is position 0, AM-3 is position 1
AM-2 waits 0 seconds, sends notification
AM-3 sees AM-2's notification log entry, deduplicates

Result: Duplicate notifications sent (one from Partition A, one from Partition B)

This is intentional - Alertmanager prefers duplicate notifications over missed alerts.

After Partition Heals

When the network partition heals:

Gossip protocol detects all peers again
Notification logs are merged (via CRDT-like merge with timestamp)
Future notifications are deduplicated correctly across all instances
Silences created in either partition are replicated to all peers

Silence Management in HA

Silences are first-class replicated state in the cluster.

Silence Creation and Updates

When a silence is created or updated on any instance:

Local storage - Silence is stored in the local state map
Broadcast - Silence is serialized (protobuf) and broadcast via gossip

Merge on receive - Other instances receive and merge the silence:

// Merge logic: last-write-wins based on UpdatedAt timestamp
if !exists || incoming.UpdatedAt > existing.UpdatedAt {
    accept_update()
}

Indexing - The silence matcher cache is updated for fast alert matching

Silence Expiry

Silences have:

StartsAt, EndsAt - The active time range
ExpiresAt - When to garbage collect (EndsAt + retention period)
UpdatedAt - For conflict resolution during merge

Each instance independently:

Evaluates silence state (pending/active/expired) based on current time
Garbage collects expired silences past their retention period
The GC is local only (no gossip) since all instances converge to the same decision

Single Pane of Glass

Users can interact with any Alertmanager instance in the cluster:

View silences - All instances have the same silence state (eventually consistent)
Create/update silences - Changes made on any instance propagate to all peers
Delete silences - Implemented as "expire immediately" + gossip

This provides a unified operational experience regardless of which instance you access.

Operational Considerations

Configuration

To configure a cluster, each Alertmanager instance needs:

# alertmanager.yml
global:
  # ... other config ...

# No cluster config in YAML - use CLI flags

Command-line flags:

alertmanager \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=am-1.example.com:9094 \
  --cluster.peer=am-2.example.com:9094 \
  --cluster.peer=am-3.example.com:9094 \
  --cluster.advertise-address=$(hostname):9094 \
  --cluster.peer-timeout=15s \
  --cluster.gossip-interval=200ms \
  --cluster.pushpull-interval=60s

Key flags:

--cluster.listen-address - Bind address for cluster communication (default: 0.0.0.0:9094)
--cluster.peer - List of peer addresses (can be repeated)
--cluster.advertise-address - Address advertised to peers (auto-detected if omitted)
--cluster.peer-timeout - Wait time per peer position for deduplication (default: 15s)
--cluster.gossip-interval - How often to gossip (default: 200ms)
--cluster.pushpull-interval - Full state sync interval (default: 60s)
--cluster.probe-interval - Peer health check interval (default: 1s)
--cluster.settle-timeout - Max time to wait for gossip settling (default: context timeout)

Prometheus Configuration

Important: Configure Prometheus to send alerts to all Alertmanager instances, not via a load balancer.

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - am-1.example.com:9093
            - am-2.example.com:9093
            - am-3.example.com:9093

This ensures:

Redundancy - If one Alertmanager is down, others still receive alerts
Independent processing - Each instance independently evaluates routing, grouping, and deduplication
No single point of failure - Load balancers introduce a single point of failure

Cluster Size Considerations

Since Alertmanager uses gossip without quorum or voting, any N instances tolerate up to N-1 failures - as long as one instance is alive, notifications will be sent.

However, cluster size involves tradeoffs:

Benefits of more instances:

Greater resilience to simultaneous failures (hardware, network, datacenter outages)
Continued operation even during maintenance windows

Costs of more instances:

In case of partitions there will be an increase in duplicate notifications
More gossip traffic

Typical deployments:

2-3 instances - Common for single-datacenter production deployments
4-5 instances - Multi-datacenter or highly critical environments

Note: Unlike consensus-based systems (etcd, Raft), odd vs. even cluster sizes make no difference - there is no voting or quorum.

Monitoring Cluster Health

Key metrics to monitor:

# Cluster size
alertmanager_cluster_members

# Peer health
alertmanager_cluster_peer_info

# Peer position (affects notification timing)
alertmanager_peer_position

# Failed peers
alertmanager_cluster_failed_peers

# State replication
alertmanager_nflog_gossip_messages_propagated_total
alertmanager_silences_gossip_messages_propagated_total

Security

By default, cluster communication is unencrypted. For production deployments, especially across WANs, use mutual TLS:

alertmanager \
  --cluster.tls-config=/etc/alertmanager/cluster-tls.yml

See Secure Cluster Traffic for details.

Persistence

Each Alertmanager instance persists:

Silences - Stored in a snapshot file (default: data/silences)
Notification log - Stored in a snapshot file (default: data/nflog)

On restart:

Instance loads silences and notification log from disk
Joins the cluster and gossips with peers
Merges state received from peers (newer timestamps win)
Begins processing notifications after gossip settling

Note: Alerts themselves are not persisted - Prometheus re-sends firing alerts regularly.

Common Pitfalls

Load balancing Prometheus → Alertmanager
- ❌ Don't use a load balancer
- ✅ Configure all instances in Prometheus
Not waiting for gossip to settle
- Can lead to missed silences or duplicate notifications on startup
- The --cluster.settle-timeout flag controls this
Network ACLs blocking cluster port
- Ensure port 9094 (or your --cluster.listen-address port) is open between all instances
- Both TCP and UDP are used by default (TCP only if using TLS transport)
Unroutable advertise addresses
- If --cluster.advertise-address is not set, Alertmanager tries to auto-detect
- For cloud/NAT environments, explicitly set a routable address
Mismatched cluster configurations
- All instances should have the same --cluster.peer-timeout and gossip settings
- Mismatches can cause unnecessary duplicates or missed notifications

How It Works: End-to-End Example

Scenario: 3-instance cluster, new alert group

Alert arrives at all 3 instances from Prometheus
Dispatcher creates aggregation group, waits group_wait (e.g., 30s)
After group_wait:
- Each instance prepares to notify
Notifier stage:
- All instances wait for gossip to settle (if just started)
- AM-1 (position 0): waits 0s, checks notification log (empty), sends notification, logs to nflog
- AM-2 (position 1): waits 15s, checks notification log (sees AM-1's entry), skips notification
- AM-3 (position 2): waits 30s, checks notification log (sees AM-1's entry), skips notification
Result: Exactly one notification sent (by AM-1)

Scenario: AM-1 fails

Alert arrives at AM-2 and AM-3 only
Dispatcher creates group, waits group_wait
Notifier stage:
- AM-1 is not in cluster (failed probe)
- AM-2 is now position 0: waits 0s, sends notification
- AM-3 is now position 1: waits 15s, sees AM-2's entry, skips
Result: Notification still sent (fail-open)

Scenario: Network partition during notification

Alert arrives at all instances
Network partition splits AM-1 from AM-2/AM-3
In partition A (AM-1):
- Position 0, waits 0s, sends notification
In partition B (AM-2, AM-3):
- AM-2 is position 0, waits 0s, sends notification
- AM-3 is position 1, waits 15s, deduplicates
Result: Two notifications sent (one per partition) - fail-open behavior

Troubleshooting

Check cluster status

# View cluster members via API
curl http://am-1:9093/api/v2/status

# Check metrics
curl http://am-1:9093/metrics | grep cluster

Diagnose split-brain

If you suspect split-brain:

Check alertmanager_cluster_members on each instance
- Should match total cluster size
Check alertmanager_cluster_peer_info{state="alive"}
- Should show all peers as alive
Review network connectivity between instances

Debug duplicate notifications

Duplicate notifications can occur due to:

Network partitions (expected, fail-open)
Gossip not settled - Check --cluster.settle-timeout
Clock skew - Ensure NTP is configured on all instances
Notification log not replicating - Check gossip metrics

Enable debug logging:

alertmanager --log.level=debug

Look for:

"Waiting for gossip to settle..."
"gossip settled; proceeding"
Deduplication decisions in notification pipeline

High Availability

High Availability

Design Goals

Architecture Overview

Gossip Protocol

Membership Management

State Replication

Gossip Settling

Notification Pipeline in HA Mode

HA-Specific Stages

1. Gossip Settling Wait

2. Peer Position-Based Wait

3. Deduplication via Notification Log

Split-Brain Handling (Fail Open)

Scenario: Network Partition

Behavior During Partition

After Partition Heals

Silence Management in HA

Silence Creation and Updates

Silence Expiry

Single Pane of Glass

Operational Considerations

Configuration

Prometheus Configuration

Cluster Size Considerations

Monitoring Cluster Health

Security

Persistence

Common Pitfalls

How It Works: End-to-End Example

Scenario: 3-instance cluster, new alert group

Scenario: AM-1 fails

Scenario: Network partition during notification

Troubleshooting

Check cluster status

Diagnose split-brain

Debug duplicate notifications

Further Reading