beats icon indicating copy to clipboard operation
beats copied to clipboard

Add `events.failure_store` metric to track events sent to Elasticsearch failure store

Open belimawr opened this issue 2 weeks ago • 4 comments

Proposed commit message

See title

Checklist

  • [x] My code follows the style guidelines of this project
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [ ] ~~I have made corresponding change to the default configuration files~~
  • [x] I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • [x] I have added an entry in ./changelog/fragments using the changelog tool.

~~## Disruptive User Impact~~ ~~## Author's Checklist~~

How to test this PR locally

Manual Testing Procedure: Failure Store Metric

Prerequisites

  1. Elasticsearch cluster (version 8.11.0+) with failure store support enabled
  2. A Beat instance (Filebeat, Metricbeat, etc.) configured to output to Elasticsearch
  3. Access to Elasticsearch API and Beat monitoring/metrics endpoint

Test Setup

1. Create a Data Stream with Failure Store Enabled

Create an index template with failure store enabled and strict mappings:

PUT _index_template/test-failure-store-template
{
  "index_patterns": ["test-failure-store-*"],
  "data_stream": {},
  "template": {
    "data_stream_options": {
      "failure_store": {
        "enabled": true
      }
    },
    "mappings": {
      "properties": {
        "method":{
            "type": "integer"
        }
      }
    }
  }
}

2. Initialize the Data Stream

Create the data stream by indexing two documents

POST test-failure-store-ds/_bulk
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "foo":"bar"}
{"create":{}}
{"@timestamp":"2025-12-12T15:42:00Z", "method": "POST"}

Ensure one of the documents went to the failure store (look for "failure_store": "used"), the response should look like this:

{
  "errors": false,
  "took": 200,
  "items": [
    {
      "create": {
        "_index": ".ds-test-failure-store-ds-2025.12.12-000001",
        "_id": "SIdcFJsBIoqQtrd2QGHk",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "create": {
        "_index": ".fs-test-failure-store-ds-2025.12.12-000002",
        "_id": "SodcFJsBIoqQtrd2QGH3",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 1,
          "failed": 0
        },
        "_seq_no": 268,
        "_primary_term": 1,
        "failure_store": "used",
        "status": 201
      }
    }
  ]
}

Ensure there is one document in the failure store:

GET test-failure-store-ds::failures/_search

3. Generate some logs that will cause mapping conflict

You can use Docker and flog for this:

docker run -it --rm mingrammer/flog -f json -d 1 -s 1 -l > /tmp/flog.ndjson

4. Run Filebeat

Build Filebeat from this PR and run it using the following configuration (adjust the output settings as necessary):

filebeat.yml

filebeat.inputs:
  - type: filestream
    id: a-very-unique-id
    enabled: true
    paths:
      - /tmp/flog.ndjson
    parsers:
      - ndjson:
          keys_under_root: true
    index: test-failure-store-ds
    file_identity.native: ~
    prospector.scanner:
      fingerprint.enabled: false

queue.mem:
  flush.timeout: 0

output.elasticsearch:
  hosts:
    - "https://localhost:9200"
  username: elastic
  password: changeme
  ssl.verification_mode: none

logging:
  metrics.period: 5s
  to_stderr: true

You can run Filebeat using jq to parse the logs:

go run . --path.home=$PWD 2>&1 | jq '{"ts": ."@timestamp", "lvl": ."log.level", "logger": ."log.logger", "m": .message, "fs": .monitoring.metrics.libbeat.output.events.failure_store}' -c

You should see some 5s metrics like this:

{"ts":"2025-12-12T16:04:40.795-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:10}                                                                                                             
{"ts":"2025-12-12T16:04:45.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:4}                                                                                                              
{"ts":"2025-12-12T16:04:50.794-0500","lvl":"info","logger":"monitoring","m":"Non-zero metrics in the last 5s","fs"
:6}                                                                                                              

Where fs is the counter of events sent to the failure store

The metrics are also published in the stats endpoint:

curl http://localhost:5066/stats | jq '.libbeat.output.events'

will output something like this:

{
  "acked": 105,
  "active": 0,
  "batches": 21,
  "dead_letter": 0,
  "dropped": 0,
  "duplicates": 0,
  "failed": 0,
  "failure_store": 105,
  "toomany": 0,
  "total": 105
}

Related issues

  • Closes https://github.com/elastic/beats/issues/47164

~~## Use cases~~ ~~## Screenshots~~ ~~## Logs~~

belimawr avatar Dec 11 '25 21:12 belimawr

:robot: GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

github-actions[bot] avatar Dec 11 '25 21:12 github-actions[bot]

This pull request does not have a backport label. If this is a bug or security fix, could you label this PR @belimawr? 🙏. For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

mergify[bot] avatar Dec 11 '25 21:12 mergify[bot]

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Dec 12 '25 21:12 elasticmachine

https://github.com/elastic/beats/pull/48075 should fixes the failing test.

belimawr avatar Dec 15 '25 13:12 belimawr