alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Proposal: Alert State Analytics for Alertmanager

Open OGKevin opened this issue 2 weeks ago • 14 comments

Summary

Add alert state analytics capabilities to Alertmanager to track state transitions of alerts over time. This will provide visibility into how alerts move between unprocessed, active, and suppressed states, including tracking which alerts are inhibited and by which other alerts.

Table of Contents

GH issues don't support ToC 😔

Motivation

The primary motivation for alert state analytics comes from the need to validate proposed enhancements to Alertmanager's clustering behavior, specifically #4315 which proposes making inhibitions part of the gossip protocol.

Currently, when investigating issues like:

  • Failed inhibitions during instance restarts (#4064)
  • Ready endpoint reporting ready before gossip settles (#3026)
  • Duplicate alert notifications in clustered deployments

...we lack the data to:

  1. Quantify the impact - How often do inhibition failures occur in production?
  2. Validate solutions - Would making inhibitions part of gossip actually solve the problem?
  3. Measure improvements - Can we prove that a change reduced the frequency of issues?
  4. Debug production issues - What state transitions led to unexpected behavior?

Without analytics, we're making architectural decisions based on theory rather than data.

Real-World Impact

At Cloudflare, we use inhibitions heavily in our alerting infrastructure. We've had numerous cases of users reporting that they are getting alerted while the alert should have been inhibited. Without analytical data, it's extremely difficult to:

  • Determine if this was actually an inhibition failure or a misconfiguration
  • Identify patterns in when inhibition failures occur
  • Correlate failures with specific cluster events or topology changes
  • Provide evidence-based answers to users about what happened

Having this analytical data would allow us to accurately debug whether inhibition failures are occurring and why, distinguishing between misconfigurations and actual bugs in the system.

Goals

  1. Track all alert state transitions including:

    • unprocessedactive
    • activesuppressed (by silence or inhibition)
    • suppressedactive
    • Any state → resolved (when alert's EndsAt timestamp is in the past)
    • resolveddeleted (when garbage collector removes it and marker.Delete() is called)
    • State changes during cluster topology changes
  2. Capture suppression relationships:

    • Which alerts were suppressed (by silence or inhibition)
    • What caused the suppression (silence ID or inhibiting alert fingerprint)
    • When the suppression was established and released
  3. Provide an interface to expose the data:

    • For database integration: REST API endpoints for querying state history
    • For event-based systems: Publish events to external message bus/queue (e.g., Kafka, Redis)
    • Enable retrieval of state history for specific alerts and time ranges
  4. Minimize performance impact:

    • Asynchronous writes to not block alert processing
    • Efficient storage to handle high-cardinality alert environments
    • Optional feature (can be disabled if not needed)

Non-Goals

  • Real-time alerting or dashboarding (analytics is for post-hoc analysis)
  • Long-term storage (retention should be configurable and limited)
  • Complex query DSL (simple API endpoints are sufficient)
  • Replication across cluster members - each instance operates independently; consumers are responsible for merging/aggregating data from multiple instances if needed

Proposed Solutions

Option 1: Direct Database Integration with State-Aware Marker

Architecture:

  • Wrap the existing MemMarker with a StateAwareMarker that records state changes to a database
  • Use an embedded analytical database (e.g., DuckDB or SQLite)
  • Employ high-performance bulk insert APIs for minimal overhead
  • Add REST API endpoints to query the analytics data

Key Components:

1. State-Aware Marker

// StateAppender records alert state changes to storage
type StateAppender interface {
    Append(fingerprint model.Fingerprint, state AlertState)
    AppendSuppressed(fingerprint model.Fingerprint, state AlertState, suppressedBy []string)
    Flush() error
    Close() error
}

// StateAwareMarker decorates the existing marker with state tracking
type StateAwareMarker interface {
    AlertMarker
    GroupMarker
    Flush() error
}
  • Decorates the existing MemMarker implementation
  • Intercepts calls to SetActiveOrSilenced(), SetInhibited(), and Delete()
  • Appends state changes to the database asynchronously via StateAppender
  • Maintains backward compatibility with existing code
  • Tracks both when alerts become resolved (EndsAt in past) and when they are deleted from the marker (GC cleanup)

2. Analytics Subscriber

// Writer defines the interface for writing alert data to storage
type Writer interface {
    InsertAlert(ctx context.Context, alert *Alert) error
}

// Subscriber subscribes to alert updates and persists them
type Subscriber interface {
    Run(ctx context.Context)
}
  • Subscribes to the alert provider's alert stream
  • Writes alert metadata (labels, annotations) to the database
  • Runs in a separate goroutine to avoid blocking alert processing

3. Database Storage

Schema:

-- Alerts table
CREATE TABLE alerts (
    id UUID PRIMARY KEY,
    fingerprint VARCHAR NOT NULL UNIQUE,
    alertname VARCHAR NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- State changes table
CREATE TABLE alert_states (
    id UUID PRIMARY KEY,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    alert_fingerprint VARCHAR NOT NULL,
    state VARCHAR NOT NULL,  -- 'unprocessed', 'active', 'suppressed', 'resolved', 'deleted'
    suppressed_by VARCHAR,  -- Fingerprint of inhibiting alert or silence ID (only for suppressed state)
    suppressed_reason VARCHAR,  -- 'silence' or 'inhibition' (only for suppressed state)
    FOREIGN KEY (alert_fingerprint) REFERENCES alerts(fingerprint)
);

-- Labels and annotations (normalized)
CREATE TABLE labels (...);
CREATE TABLE annotations (...);
  • Uses deterministic UUIDs (UUIDv5) to avoid duplicate inserts
  • Maintains in-memory maps to skip already-seen fingerprints
  • Transactions ensure consistency

4. Storage Interface

// Database defines the interface for analytics storage
type Database interface {
    Reader
    Writer
}

type Reader interface {
    GetAlertStatesByFingerprint(ctx context.Context, fingerprint model.Fingerprint) ([]*AlertState, error)
    GetAllAlertsAndTheirStates(ctx context.Context) ([]*Alert, error)
}

5. REST API Endpoints

New endpoints:

  • GET /api/v2/alerts/states - Get all alerts with their recent states
  • GET /api/v2/alerts/{fingerprint}/states - Get state history for a specific alert

Advantages:

  • Minimal code changes to core alert processing logic
  • High performance (bulk insert APIs can handle millions of rows/sec)
  • Embedded database (no external dependencies)
  • SQL queries for flexible analysis
  • Relatively straightforward implementation

Disadvantages:

  • Tight coupling between marker and database
  • Requires embedded database dependency
  • Database file management (rotation, cleanup)
  • Potential for write amplification in high-cardinality environments

Performance Considerations:

  • Bulk insert APIs provide extremely fast writes
  • In-memory maps reduce duplicate writes by ~90%
  • Async writes don't block alert processing
  • Configurable retention (default: 7 days recommended)

Option 2: Event-Based Architecture

Architecture:

  • Introduce an event system for alert lifecycle events
  • Emit events for state changes without modifying the marker
  • Publish events to external message bus/queue systems (e.g., Kafka, Redis, RabbitMQ)
  • No built-in storage or REST API - consumers handle data persistence and querying

Key Components:

1. Event System

type AlertEventMetadata struct {
    Alertname        string
    Labels           model.LabelSet
    Annotations      model.LabelSet
    SuppressedBy     []string  // Silence IDs or inhibiting alert fingerprints
    SuppressedReason string    // 'silence' or 'inhibition'
}

type AlertEvent struct {
    Timestamp   time.Time
    Fingerprint model.Fingerprint
    EventType   EventType  // StateChanged, Suppressed, Unsuppressed, Resolved, Deleted
    OldState    AlertState
    NewState    AlertState
    Metadata    AlertEventMetadata
}

type EventHandler interface {
    HandleEvent(ctx context.Context, event AlertEvent) error
}

type EventBus interface {
    Subscribe(handler EventHandler)
    Publish(ctx context.Context, event AlertEvent) error
}

2. Event Emission Points

  • Modify MemMarker.SetActiveOrSilenced() to emit events (for active/silenced transitions)
  • Modify MemMarker.SetInhibited() to emit events (for inhibition transitions)
  • Modify MemMarker.Delete() to emit events (for alert deletion)
  • Hook into alert resolution detection (when EndsAt timestamp passes)
  • Emit events with full context including suppression details in metadata
  • Events include timestamps for ordering; consumers can use timestamp or UUIDv7 to handle out-of-order delivery

3. Event Publisher

type EventPublisher interface {
    EventHandler
}

// Implementation would publish events to external message bus (Kafka, Redis, etc.)
// Examples: KafkaPublisher, RedisPublisher, RabbitMQPublisher

Advantages:

  • Loose coupling - analytics doesn't affect core logic
  • Extensible - easy to add new event handlers
  • Could be used for other features (webhooks, audit logs)
  • Easier to disable or configure
  • Offloads storage and querying to external systems
  • Can integrate with existing event processing infrastructure

Disadvantages:

  • More invasive changes to MemMarker
  • Event bus adds complexity
  • Potential for event loss if handlers are slow
  • Need to implement event buffering/retries
  • Requires external infrastructure (message bus)
  • No built-in querying capability - consumers must implement their own storage/queries
  • More operational overhead

Configuration

Option 1: Database Integration Configuration

# alertmanager.yml
analytics:
  enabled: true
  type: database
  storage:
    path: /data/analytics.db
    retention: 168h  # 7 days
  # Optional: limit database size
  max_size_mb: 1024
  # Optional: sample rate (1.0 = 100%, 0.1 = 10%)
  sample_rate: 1.0

Command-line flags:

--analytics.enabled
--analytics.type=database
--analytics.storage.path=/data/analytics.db
--analytics.retention=168h

Option 2: Event Publisher Configuration

# alertmanager.yml
analytics:
  enabled: true
  type: event_publisher
  publisher:
    type: kafka  # or redis, rabbitmq
    brokers:
      - kafka1:9092
      - kafka2:9092
    topic: alertmanager-state-events
    # Optional: sample rate
    sample_rate: 1.0

Command-line flags:

--analytics.enabled
--analytics.type=event_publisher
--analytics.publisher.type=kafka
--analytics.publisher.brokers=kafka1:9092,kafka2:9092
--analytics.publisher.topic=alertmanager-state-events

API Examples (Option 1 Only)

Get all alerts with recent state changes

GET /api/v2/alerts/states

Response:
[
  {
    "fingerprint": "abc123",
    "alertname": "HighCPU",
    "labels": {...},
    "annotations": {...},
    "states": [
      {
        "id": "uuid",
        "timestamp": "2025-11-13T14:30:00Z",
        "state": "active"
      },
      {
        "id": "uuid",
        "timestamp": "2025-11-13T14:35:00Z",
        "state": "suppressed",
        "suppressed_by": "def456",
        "suppressed_reason": "inhibited"
      }
    ]
  }
]

Get state history for a specific alert

GET /api/v2/alerts/{fingerprint}/states

Response:
{
  "fingerprint": "abc123",
  "states": [
    {
      "id": "uuid",
      "timestamp": "2025-11-13T14:00:00Z",
      "state": "active"
    },
    {
      "id": "uuid",
      "timestamp": "2025-11-13T14:30:00Z",
      "state": "suppressed",
      "suppressed_by": "def456",
      "suppressed_reason": "inhibited"
    },
    {
      "id": "uuid",
      "timestamp": "2025-11-13T14:45:00Z",
      "state": "active"
    }
  ]
}

Event Examples (Option 2 Only)

Alert State Change Event

{
  "timestamp": "2025-11-13T14:30:00Z",
  "fingerprint": "abc123",
  "event_type": "state_changed",
  "old_state": "active",
  "new_state": "suppressed",
  "metadata": {
    "alertname": "HighCPU",
    "labels": {...},
    "suppressed_by": "def456",
    "suppressed_reason": "inhibition"
  }
}

Alert Deletion Event

{
  "timestamp": "2025-11-13T15:00:00Z",
  "fingerprint": "abc123",
  "event_type": "deleted",
  "old_state": "resolved",
  "new_state": "deleted",
  "metadata": {
    "alertname": "HighCPU",
    "labels": {...}
  }
}

Open Questions

  1. Retention (Option 1 only): What's the right default retention period?

    • Proposal: 7 days (168 hours)
    • Rationale: Sufficient for post-mortem analysis, limited disk usage
    • Configurable for different use cases
  2. Schema Evolution: How do we handle schema changes?

    • Option 1: Version the schema in the database, provide migration path
    • Option 2: Version events, consumers handle different event versions
    • Consider forward/backward compatibility in both cases

References

OGKevin avatar Nov 13 '25 15:11 OGKevin

A third option could be something similar to nflog or making nflog more generic so it can be used for things other than notifications as well. This means that we don't just depend on the data for Analytics but replication would allow peers to see all decisions made by others.

For example a new peer joining the cluster can immediately have a view of all active/inhibited/silenced alerts. This information can then be used for a peer to bypass heavy calculations if the peer ahead of it just make a decision to suppress an alert (think of it as replaying the log optionally).

APIs can be implemented on top to query this "log".

Related discussion on extending nflog https://github.com/prometheus/alertmanager/pull/4682#issuecomment-3493136278

siavashs avatar Nov 13 '25 15:11 siavashs

An addition for the use case:

Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.

An addition for Option 2: Event-Based Architecture

It may not need to support an (external) event type / queuing system. Maybe writing a append only file like Redis (AOF - https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/) to disk is enough. This way, no external libs need to be integrated and the operator can take care to read and ship the file elsewhere.

Or supportive tooling to read and consume this file can be build into an own tool (next to amtool).

andygrunwald avatar Nov 13 '25 16:11 andygrunwald

You could also resume work on https://github.com/prometheus/alertmanager/pull/3673 as a third option. The PR mentions that tracing is already available in Prometheus (although I don't know how extensive), but I think OTEL tracing would fit much better into the ecosystem.

grobinson-grafana avatar Nov 13 '25 17:11 grobinson-grafana

Tracing has the advantage of sampling, because full analytics will be expensive as it will happen per alert. Sample rate can be adjusted and even few samples can still be good enough for identifying to capture issues.

siavashs avatar Nov 13 '25 19:11 siavashs

I also like the idea of incorporating it as otel telemetry, perhaps we emit the state changes as log events. That way the need for an event bus dis-appears, reduces complexity in implementation and the querying capabilities can be delivered via the telemetry back end.

Ideally for alert manager we could set the otlp endpoint to send data to. In the case where a user wants to add the events to a messaging system such as kafka, the otel collector can be used and the corresponding messaging system exporter.

thompson-tomo avatar Nov 14 '25 05:11 thompson-tomo

Tracing has the advantage of sampling, because full analytics will be expensive as it will happen per alert. Sample rate can be adjusted and even few samples can still be good enough for identifying to capture issues.

I would argue that sampling alone is not a solid argument to not pick option 1 or 2, as the concept of sampling can be added there as well 🤔 .

I also like the idea of incorporating it as otel telemetry, perhaps we emit the state changes as log events.

This sounds interesting indeed. I personally don't have the knowledge needed to decide if this is solid or not. Would need to spent some time digging into OTEL to see if it is "as simple" as this.

but I think OTEL tracing would fit much better into the ecosystem.

Tracing alone I would say might not be it. I would argue that with tracing alone, it would be impossible to hard to answer analytics based questions like:

Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.

OGKevin avatar Nov 14 '25 08:11 OGKevin

Tracing has the advantage of sampling, because full analytics will be expensive as it will happen per alert. Sample rate can be adjusted and even few samples can still be good enough for identifying to capture issues.

I would argue that sampling alone is not a solid argument to not pick option 1 or 2, as the concept of sampling can be added there as well 🤔 .

I also like the idea of incorporating it as otel telemetry, perhaps we emit the state changes as log events.

This sounds interesting indeed. I personally don't have the knowledge needed to decide if this is solid or not. Would need to spent some time digging into OTEL to see if it is "as simple" as this.

but I think OTEL tracing would fit much better into the ecosystem.

Tracing alone I would say might not be it. I would argue that with tracing alone, it would be impossible / too hard to answer analytics based questions like:

Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.

OGKevin avatar Nov 14 '25 08:11 OGKevin

Perhaps @ArthurSens can be of assistance on the interoperability with otel topic as well as the otel collector. Note the collector could even transform the events into metrics visible in prometheus.

thompson-tomo avatar Nov 14 '25 09:11 thompson-tomo

For integrations like PagerDuty we could also use an external proxy which sits between Alertmanager and "internet", it will "log" or "trace" all notifications and can enrich data if required. This can then emit any form of observability data desired.

Integration would be easy through a "global proxy" configuration in Alertmanager.

We could also have a similar component in front of Alertmanager which analyses all the alerts. So basically we analyse input and output and emit the merged data.

graph TD;
    Prometheus-->Input_Analyser-->Alertmanager-->Output_Analyser-->Integration(PagerDuty);
    Input_Analyser-->Data_Merger/Enricher;
    Output_Analyser-->Data_Merger/Enricher;

This has 2 benefits:

  1. zero performance cost for Alertmanager
  2. optional component(s) for analytics

siavashs avatar Nov 14 '25 10:11 siavashs

I like your thinking @siavashs and in fact I have proposed some small extensions to Otlp/OpenTelemetry to help facilitate this use case and making it a first class experience https://github.com/open-telemetry/opentelemetry-specification/issues/4729

The process diagram is almost as I imagined it being when I was writing that proposal. In my case the proxy would be the otel collector.

thompson-tomo avatar Nov 14 '25 11:11 thompson-tomo

For integrations like PagerDuty we could also use an external proxy which sits between Alertmanager and "internet", it will "log" or "trace" all notifications and can enrich data if required. This can then emit any form of observability data desired.

Integration would be easy through a "global proxy" configuration in Alertmanager.

We tend to do this as well. If wiring protocols(e.g: http contracts) are the same, it is not hard to achieve, I would say.

Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.

I am also on the fence of whether adding this complexity into Alertmanager can be justified easily. Moreover, if we were to have tracing added to AM, it would be up to users to gather analytics from them. It doesn't feel so natural to bring this complexity to generic-purpose tooling despite there are already some ways(I might be wrong tho 😅 ) to achieve similar. Some of the technical problems from the proposals(supposed to be implemented) are already solved by tracing frameworks out-of-box(e.g: batching, sampling otlp-{http/grpc}, vendor agnostic {connectors/collectors)

I reckon, with tracing tracing + proxies, some of these analytics would be possible without adding complexity + more resource usage to alertmanager itself.

We also suffer from make inhibitions part of the gossip, it would be nice to see Feature: Instrument Alertmanager for distributed tracing moving further. It seems like the author doesn't have much time for that.

Do we have any volunteer to push it further tho? 😅

eyazici90 avatar Nov 14 '25 12:11 eyazici90

it would be nice to see https://github.com/prometheus/alertmanager/issues/3670#issuecomment-2595014733 moving further. It seems like the author doesn't have much time for that.

I pinged them here but I think they are not available https://github.com/prometheus/alertmanager/pull/3673#issuecomment-3473752768

Do we have any volunteer to push it further tho? 😅

Tracing is on my list unless if someone picks it up before me.

siavashs avatar Nov 14 '25 13:11 siavashs

Tracing is on my list unless if someone picks it up before me.

I see no reason to wait longer, in-flight PR is getting closer to 2years now 😅 I would say, If u've time for it, just go, would be much appreciated 🙏 (_I can also assist with reviews if u want to, although, I am not an maintainer, only a contributor _)

U may also give some credits to the former author in ur PR, thus everyone is happy ;)

eyazici90 avatar Nov 14 '25 13:11 eyazici90

Assigned #3673 to myself, will open a new PR. Original author will get credits for their contribution.

siavashs avatar Nov 14 '25 13:11 siavashs