Add `events.failure_store` metric to track events sent to Elasticsearch failure store
Proposed commit message
See title
Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [ ] ~~I have made corresponding change to the default configuration files~~
- [x] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the
stresstest.shscript to run them under stress conditions and race detector to verify their stability. - [x] I have added an entry in
./changelog/fragmentsusing the changelog tool.
~~## Disruptive User Impact~~ ~~## Author's Checklist~~
How to test this PR locally
Manual Testing Procedure: Failure Store Metric
Prerequisites
- Elasticsearch cluster (version 8.11.0+) with failure store support enabled
- A Beat instance (Filebeat, Metricbeat, etc.) configured to output to Elasticsearch
- Access to Elasticsearch API and Beat monitoring/metrics endpoint
Test Setup
1. Create a Data Stream with Failure Store Enabled
Create an index template with failure store enabled and strict mappings:
PUT _index_template/test-failure-store-template
{
"index_patterns": ["test-failure-store-*"],
"data_stream": {},
"template": {
"data_stream_options": {
"failure_store": {
"enabled": true
}
},
"mappings": {
"properties": {
"method":{
"type": "integer"
}
}
}
}
}
2. Initialize the Data Stream
Create the data stream by indexing two documents
POST test-failure-store-ds/_bulk
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "foo":"bar"}
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "method": "POST"}
Ensure one of the documents went to the failure store (look for
"failure_store": "used"), the response should look like this:
{
"errors": false,
"took": 200,
"items": [
{
"create": {
"_index": ".ds-test-failure-store-ds-2025.12.12-000001",
"_id": "SIdcFJsBIoqQtrd2QGHk",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 2,
"_primary_term": 1,
"status": 201
}
},
{
"create": {
"_index": ".fs-test-failure-store-ds-2025.12.12-000002",
"_id": "SodcFJsBIoqQtrd2QGH3",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 268,
"_primary_term": 1,
"failure_store": "used",
"status": 201
}
}
]
}
Ensure there is one document in the failure store:
GET test-failure-store-ds::failures/_search
3. Generate some logs that will cause mapping conflict
You can use Docker and flog for this:
docker run -it --rm mingrammer/flog -f json -d 1 -s 1 -l > /tmp/flog.ndjson
4. Run Filebeat
Build Filebeat from this PR and run it using the following configuration (adjust the output settings as necessary):
filebeat.yml
filebeat.inputs:
- type: filestream
id: a-very-unique-id
enabled: true
paths:
- /tmp/flog.ndjson
parsers:
- ndjson:
keys_under_root: true
index: test-failure-store-ds
file_identity.native: ~
prospector.scanner:
fingerprint.enabled: false
queue.mem:
flush.timeout: 0
output.elasticsearch:
hosts:
- "https://localhost:9200"
username: elastic
password: changeme
ssl.verification_mode: none
logging:
metrics.period: 5s
to_stderr: true
You can run Filebeat using jq to parse the logs:
go run . --path.home=$PWD 2>&1 | jq '{"ts": ."@timestamp", "lvl": ."log.level", "logger": ."log.logger", "m": .message, "fs": .monitoring.metrics.libbeat.output.events.failure_store}' -c
You should see some 5s metrics like this:
{"ts":"2025-12-12T16:04:40.795-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:10}
{"ts":"2025-12-12T16:04:45.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:4}
{"ts":"2025-12-12T16:04:50.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:6}
Where fs is the counter of events sent to the failure store
The metrics are also published in the stats endpoint:
curl http://localhost:5066/stats | jq '.libbeat.output.events'
will output something like this:
{
"acked": 105,
"active": 0,
"batches": 21,
"dead_letter": 0,
"dropped": 0,
"duplicates": 0,
"failed": 0,
"failure_store": 105,
"toomany": 0,
"total": 105
}
Related issues
- Closes https://github.com/elastic/beats/issues/47164
~~## Use cases~~ ~~## Screenshots~~ ~~## Logs~~
:robot: GitHub comments
Just comment with:
rundocs-build: Re-trigger the docs validation. (use unformatted text in the comment!)
This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @belimawr? 🙏. For such, you'll need to label your PR with:
- The upcoming major version of the Elastic Stack
- The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)
To fixup this pull request, you need to add the backport labels for the needed branches, such as:
backport-8./dis the label to automatically backport to the8./dbranch./dis the digitbackport-active-allis the label that automatically backports to all active branches.backport-active-8is the label that automatically backports to all active minor branches for the 8 major.backport-active-9is the label that automatically backports to all active minor branches for the 9 major.
🔍 Preview links for changed docs
- docs/reference/auditbeat/understand-auditbeat-logs.md
- docs/reference/filebeat/understand-filebeat-logs.md
- docs/reference/heartbeat/understand-heartbeat-logs.md
- docs/reference/metricbeat/understand-metricbeat-logs.md
- docs/reference/packetbeat/understand-packetbeat-logs.md
- docs/reference/winlogbeat/understand-winlogbeat-logs.md
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
https://github.com/elastic/beats/pull/48075 should fixes the failing test.