alertmanager
alertmanager copied to clipboard
Proposal: Alert State Analytics for Alertmanager
Summary
Add alert state analytics capabilities to Alertmanager to track state transitions of alerts over time. This will provide visibility into how alerts move between unprocessed, active, and suppressed states, including tracking which alerts are inhibited and by which other alerts.
Table of Contents
GH issues don't support ToC 😔
Motivation
The primary motivation for alert state analytics comes from the need to validate proposed enhancements to Alertmanager's clustering behavior, specifically #4315 which proposes making inhibitions part of the gossip protocol.
Currently, when investigating issues like:
- Failed inhibitions during instance restarts (#4064)
- Ready endpoint reporting ready before gossip settles (#3026)
- Duplicate alert notifications in clustered deployments
...we lack the data to:
- Quantify the impact - How often do inhibition failures occur in production?
- Validate solutions - Would making inhibitions part of gossip actually solve the problem?
- Measure improvements - Can we prove that a change reduced the frequency of issues?
- Debug production issues - What state transitions led to unexpected behavior?
Without analytics, we're making architectural decisions based on theory rather than data.
Real-World Impact
At Cloudflare, we use inhibitions heavily in our alerting infrastructure. We've had numerous cases of users reporting that they are getting alerted while the alert should have been inhibited. Without analytical data, it's extremely difficult to:
- Determine if this was actually an inhibition failure or a misconfiguration
- Identify patterns in when inhibition failures occur
- Correlate failures with specific cluster events or topology changes
- Provide evidence-based answers to users about what happened
Having this analytical data would allow us to accurately debug whether inhibition failures are occurring and why, distinguishing between misconfigurations and actual bugs in the system.
Goals
-
Track all alert state transitions including:
unprocessed→activeactive→suppressed(by silence or inhibition)suppressed→active- Any state →
resolved(when alert's EndsAt timestamp is in the past) resolved→deleted(when garbage collector removes it andmarker.Delete()is called)- State changes during cluster topology changes
-
Capture suppression relationships:
- Which alerts were suppressed (by silence or inhibition)
- What caused the suppression (silence ID or inhibiting alert fingerprint)
- When the suppression was established and released
-
Provide an interface to expose the data:
- For database integration: REST API endpoints for querying state history
- For event-based systems: Publish events to external message bus/queue (e.g., Kafka, Redis)
- Enable retrieval of state history for specific alerts and time ranges
-
Minimize performance impact:
- Asynchronous writes to not block alert processing
- Efficient storage to handle high-cardinality alert environments
- Optional feature (can be disabled if not needed)
Non-Goals
- Real-time alerting or dashboarding (analytics is for post-hoc analysis)
- Long-term storage (retention should be configurable and limited)
- Complex query DSL (simple API endpoints are sufficient)
- Replication across cluster members - each instance operates independently; consumers are responsible for merging/aggregating data from multiple instances if needed
Proposed Solutions
Option 1: Direct Database Integration with State-Aware Marker
Architecture:
- Wrap the existing
MemMarkerwith aStateAwareMarkerthat records state changes to a database - Use an embedded analytical database (e.g., DuckDB or SQLite)
- Employ high-performance bulk insert APIs for minimal overhead
- Add REST API endpoints to query the analytics data
Key Components:
1. State-Aware Marker
// StateAppender records alert state changes to storage
type StateAppender interface {
Append(fingerprint model.Fingerprint, state AlertState)
AppendSuppressed(fingerprint model.Fingerprint, state AlertState, suppressedBy []string)
Flush() error
Close() error
}
// StateAwareMarker decorates the existing marker with state tracking
type StateAwareMarker interface {
AlertMarker
GroupMarker
Flush() error
}
- Decorates the existing
MemMarkerimplementation - Intercepts calls to
SetActiveOrSilenced(),SetInhibited(), andDelete() - Appends state changes to the database asynchronously via
StateAppender - Maintains backward compatibility with existing code
- Tracks both when alerts become resolved (EndsAt in past) and when they are deleted from the marker (GC cleanup)
2. Analytics Subscriber
// Writer defines the interface for writing alert data to storage
type Writer interface {
InsertAlert(ctx context.Context, alert *Alert) error
}
// Subscriber subscribes to alert updates and persists them
type Subscriber interface {
Run(ctx context.Context)
}
- Subscribes to the alert provider's alert stream
- Writes alert metadata (labels, annotations) to the database
- Runs in a separate goroutine to avoid blocking alert processing
3. Database Storage
Schema:
-- Alerts table
CREATE TABLE alerts (
id UUID PRIMARY KEY,
fingerprint VARCHAR NOT NULL UNIQUE,
alertname VARCHAR NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- State changes table
CREATE TABLE alert_states (
id UUID PRIMARY KEY,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
alert_fingerprint VARCHAR NOT NULL,
state VARCHAR NOT NULL, -- 'unprocessed', 'active', 'suppressed', 'resolved', 'deleted'
suppressed_by VARCHAR, -- Fingerprint of inhibiting alert or silence ID (only for suppressed state)
suppressed_reason VARCHAR, -- 'silence' or 'inhibition' (only for suppressed state)
FOREIGN KEY (alert_fingerprint) REFERENCES alerts(fingerprint)
);
-- Labels and annotations (normalized)
CREATE TABLE labels (...);
CREATE TABLE annotations (...);
- Uses deterministic UUIDs (UUIDv5) to avoid duplicate inserts
- Maintains in-memory maps to skip already-seen fingerprints
- Transactions ensure consistency
4. Storage Interface
// Database defines the interface for analytics storage
type Database interface {
Reader
Writer
}
type Reader interface {
GetAlertStatesByFingerprint(ctx context.Context, fingerprint model.Fingerprint) ([]*AlertState, error)
GetAllAlertsAndTheirStates(ctx context.Context) ([]*Alert, error)
}
5. REST API Endpoints
New endpoints:
GET /api/v2/alerts/states- Get all alerts with their recent statesGET /api/v2/alerts/{fingerprint}/states- Get state history for a specific alert
Advantages:
- Minimal code changes to core alert processing logic
- High performance (bulk insert APIs can handle millions of rows/sec)
- Embedded database (no external dependencies)
- SQL queries for flexible analysis
- Relatively straightforward implementation
Disadvantages:
- Tight coupling between marker and database
- Requires embedded database dependency
- Database file management (rotation, cleanup)
- Potential for write amplification in high-cardinality environments
Performance Considerations:
- Bulk insert APIs provide extremely fast writes
- In-memory maps reduce duplicate writes by ~90%
- Async writes don't block alert processing
- Configurable retention (default: 7 days recommended)
Option 2: Event-Based Architecture
Architecture:
- Introduce an event system for alert lifecycle events
- Emit events for state changes without modifying the marker
- Publish events to external message bus/queue systems (e.g., Kafka, Redis, RabbitMQ)
- No built-in storage or REST API - consumers handle data persistence and querying
Key Components:
1. Event System
type AlertEventMetadata struct {
Alertname string
Labels model.LabelSet
Annotations model.LabelSet
SuppressedBy []string // Silence IDs or inhibiting alert fingerprints
SuppressedReason string // 'silence' or 'inhibition'
}
type AlertEvent struct {
Timestamp time.Time
Fingerprint model.Fingerprint
EventType EventType // StateChanged, Suppressed, Unsuppressed, Resolved, Deleted
OldState AlertState
NewState AlertState
Metadata AlertEventMetadata
}
type EventHandler interface {
HandleEvent(ctx context.Context, event AlertEvent) error
}
type EventBus interface {
Subscribe(handler EventHandler)
Publish(ctx context.Context, event AlertEvent) error
}
2. Event Emission Points
- Modify
MemMarker.SetActiveOrSilenced()to emit events (for active/silenced transitions) - Modify
MemMarker.SetInhibited()to emit events (for inhibition transitions) - Modify
MemMarker.Delete()to emit events (for alert deletion) - Hook into alert resolution detection (when EndsAt timestamp passes)
- Emit events with full context including suppression details in metadata
- Events include timestamps for ordering; consumers can use timestamp or UUIDv7 to handle out-of-order delivery
3. Event Publisher
type EventPublisher interface {
EventHandler
}
// Implementation would publish events to external message bus (Kafka, Redis, etc.)
// Examples: KafkaPublisher, RedisPublisher, RabbitMQPublisher
Advantages:
- Loose coupling - analytics doesn't affect core logic
- Extensible - easy to add new event handlers
- Could be used for other features (webhooks, audit logs)
- Easier to disable or configure
- Offloads storage and querying to external systems
- Can integrate with existing event processing infrastructure
Disadvantages:
- More invasive changes to
MemMarker - Event bus adds complexity
- Potential for event loss if handlers are slow
- Need to implement event buffering/retries
- Requires external infrastructure (message bus)
- No built-in querying capability - consumers must implement their own storage/queries
- More operational overhead
Configuration
Option 1: Database Integration Configuration
# alertmanager.yml
analytics:
enabled: true
type: database
storage:
path: /data/analytics.db
retention: 168h # 7 days
# Optional: limit database size
max_size_mb: 1024
# Optional: sample rate (1.0 = 100%, 0.1 = 10%)
sample_rate: 1.0
Command-line flags:
--analytics.enabled
--analytics.type=database
--analytics.storage.path=/data/analytics.db
--analytics.retention=168h
Option 2: Event Publisher Configuration
# alertmanager.yml
analytics:
enabled: true
type: event_publisher
publisher:
type: kafka # or redis, rabbitmq
brokers:
- kafka1:9092
- kafka2:9092
topic: alertmanager-state-events
# Optional: sample rate
sample_rate: 1.0
Command-line flags:
--analytics.enabled
--analytics.type=event_publisher
--analytics.publisher.type=kafka
--analytics.publisher.brokers=kafka1:9092,kafka2:9092
--analytics.publisher.topic=alertmanager-state-events
API Examples (Option 1 Only)
Get all alerts with recent state changes
GET /api/v2/alerts/states
Response:
[
{
"fingerprint": "abc123",
"alertname": "HighCPU",
"labels": {...},
"annotations": {...},
"states": [
{
"id": "uuid",
"timestamp": "2025-11-13T14:30:00Z",
"state": "active"
},
{
"id": "uuid",
"timestamp": "2025-11-13T14:35:00Z",
"state": "suppressed",
"suppressed_by": "def456",
"suppressed_reason": "inhibited"
}
]
}
]
Get state history for a specific alert
GET /api/v2/alerts/{fingerprint}/states
Response:
{
"fingerprint": "abc123",
"states": [
{
"id": "uuid",
"timestamp": "2025-11-13T14:00:00Z",
"state": "active"
},
{
"id": "uuid",
"timestamp": "2025-11-13T14:30:00Z",
"state": "suppressed",
"suppressed_by": "def456",
"suppressed_reason": "inhibited"
},
{
"id": "uuid",
"timestamp": "2025-11-13T14:45:00Z",
"state": "active"
}
]
}
Event Examples (Option 2 Only)
Alert State Change Event
{
"timestamp": "2025-11-13T14:30:00Z",
"fingerprint": "abc123",
"event_type": "state_changed",
"old_state": "active",
"new_state": "suppressed",
"metadata": {
"alertname": "HighCPU",
"labels": {...},
"suppressed_by": "def456",
"suppressed_reason": "inhibition"
}
}
Alert Deletion Event
{
"timestamp": "2025-11-13T15:00:00Z",
"fingerprint": "abc123",
"event_type": "deleted",
"old_state": "resolved",
"new_state": "deleted",
"metadata": {
"alertname": "HighCPU",
"labels": {...}
}
}
Open Questions
-
Retention (Option 1 only): What's the right default retention period?
- Proposal: 7 days (168 hours)
- Rationale: Sufficient for post-mortem analysis, limited disk usage
- Configurable for different use cases
-
Schema Evolution: How do we handle schema changes?
- Option 1: Version the schema in the database, provide migration path
- Option 2: Version events, consumers handle different event versions
- Consider forward/backward compatibility in both cases
References
A third option could be something similar to nflog or making nflog more generic so it can be used for things other than notifications as well.
This means that we don't just depend on the data for Analytics but replication would allow peers to see all decisions made by others.
For example a new peer joining the cluster can immediately have a view of all active/inhibited/silenced alerts.
This information can then be used for a peer to bypass heavy calculations if the peer ahead of it just make a decision to suppress an alert (think of it as replaying the log optionally).
APIs can be implemented on top to query this "log".
Related discussion on extending nflog https://github.com/prometheus/alertmanager/pull/4682#issuecomment-3493136278
An addition for the use case:
Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.
An addition for Option 2: Event-Based Architecture
It may not need to support an (external) event type / queuing system. Maybe writing a append only file like Redis (AOF - https://redis.io/docs/latest/operate/oss_and_stack/management/persistence/) to disk is enough. This way, no external libs need to be integrated and the operator can take care to read and ship the file elsewhere.
Or supportive tooling to read and consume this file can be build into an own tool (next to amtool).
You could also resume work on https://github.com/prometheus/alertmanager/pull/3673 as a third option. The PR mentions that tracing is already available in Prometheus (although I don't know how extensive), but I think OTEL tracing would fit much better into the ecosystem.
Tracing has the advantage of sampling, because full analytics will be expensive as it will happen per alert.
Sample rate can be adjusted and even few samples can still be good enough for identifying to capture issues.
I also like the idea of incorporating it as otel telemetry, perhaps we emit the state changes as log events. That way the need for an event bus dis-appears, reduces complexity in implementation and the querying capabilities can be delivered via the telemetry back end.
Ideally for alert manager we could set the otlp endpoint to send data to. In the case where a user wants to add the events to a messaging system such as kafka, the otel collector can be used and the corresponding messaging system exporter.
Tracing has the advantage of
sampling, becausefullanalytics will be expensive as it will happenper alert. Sample rate can be adjusted and even few samples can still be good enough for identifying to capture issues.
I would argue that sampling alone is not a solid argument to not pick option 1 or 2, as the concept of sampling can be added there as well 🤔 .
I also like the idea of incorporating it as otel telemetry, perhaps we emit the state changes as log events.
This sounds interesting indeed. I personally don't have the knowledge needed to decide if this is solid or not. Would need to spent some time digging into OTEL to see if it is "as simple" as this.
but I think OTEL tracing would fit much better into the ecosystem.
Tracing alone I would say might not be it. I would argue that with tracing alone, it would be impossible to hard to answer analytics based questions like:
Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.
Tracing has the advantage of
sampling, becausefullanalytics will be expensive as it will happenper alert. Sample rate can be adjusted and even few samples can still be good enough for identifying to capture issues.
I would argue that sampling alone is not a solid argument to not pick option 1 or 2, as the concept of sampling can be added there as well 🤔 .
I also like the idea of incorporating it as otel telemetry, perhaps we emit the state changes as log events.
This sounds interesting indeed. I personally don't have the knowledge needed to decide if this is solid or not. Would need to spent some time digging into OTEL to see if it is "as simple" as this.
but I think OTEL tracing would fit much better into the ecosystem.
Tracing alone I would say might not be it. I would argue that with tracing alone, it would be impossible / too hard to answer analytics based questions like:
Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.
Perhaps @ArthurSens can be of assistance on the interoperability with otel topic as well as the otel collector. Note the collector could even transform the events into metrics visible in prometheus.
For integrations like PagerDuty we could also use an external proxy which sits between Alertmanager and "internet", it will "log" or "trace" all notifications and can enrich data if required. This can then emit any form of observability data desired.
Integration would be easy through a "global proxy" configuration in Alertmanager.
We could also have a similar component in front of Alertmanager which analyses all the alerts.
So basically we analyse input and output and emit the merged data.
graph TD;
Prometheus-->Input_Analyser-->Alertmanager-->Output_Analyser-->Integration(PagerDuty);
Input_Analyser-->Data_Merger/Enricher;
Output_Analyser-->Data_Merger/Enricher;
This has 2 benefits:
- zero performance cost for Alertmanager
- optional component(s) for analytics
I like your thinking @siavashs and in fact I have proposed some small extensions to Otlp/OpenTelemetry to help facilitate this use case and making it a first class experience https://github.com/open-telemetry/opentelemetry-specification/issues/4729
The process diagram is almost as I imagined it being when I was writing that proposal. In my case the proxy would be the otel collector.
For integrations like PagerDuty we could also use an external proxy which sits between Alertmanager and "internet", it will "log" or "trace" all notifications and can enrich data if required. This can then emit any form of observability data desired.
Integration would be easy through a "global proxy" configuration in Alertmanager.
We tend to do this as well. If wiring protocols(e.g: http contracts) are the same, it is not hard to achieve, I would say.
Having a proper way to analyze alerts enable teams to dig into their alerting volume, analyze them and take actions (read priotizations) towards healthier on-call habbits. Right now, receivers (e.g. pagerduty) need to support this type of analytics to do it.
I am also on the fence of whether adding this complexity into Alertmanager can be justified easily. Moreover, if we were to have tracing added to AM, it would be up to users to gather analytics from them. It doesn't feel so natural to bring this complexity to generic-purpose tooling despite there are already some ways(I might be wrong tho 😅 ) to achieve similar. Some of the technical problems from the proposals(supposed to be implemented) are already solved by tracing frameworks out-of-box(e.g: batching, sampling otlp-{http/grpc}, vendor agnostic {connectors/collectors)
I reckon, with tracing tracing + proxies, some of these analytics would be possible without adding complexity + more resource usage to alertmanager itself.
We also suffer from make inhibitions part of the gossip, it would be nice to see Feature: Instrument Alertmanager for distributed tracing moving further. It seems like the author doesn't have much time for that.
Do we have any volunteer to push it further tho? 😅
it would be nice to see https://github.com/prometheus/alertmanager/issues/3670#issuecomment-2595014733 moving further. It seems like the author doesn't have much time for that.
I pinged them here but I think they are not available https://github.com/prometheus/alertmanager/pull/3673#issuecomment-3473752768
Do we have any volunteer to push it further tho? 😅
Tracing is on my list unless if someone picks it up before me.
Tracing is on my list unless if someone picks it up before me.
I see no reason to wait longer, in-flight PR is getting closer to 2years now 😅 I would say, If u've time for it, just go, would be much appreciated 🙏 (_I can also assist with reviews if u want to, although, I am not an maintainer, only a contributor _)
U may also give some credits to the former author in ur PR, thus everyone is happy ;)
Assigned #3673 to myself, will open a new PR. Original author will get credits for their contribution.