fluent-bit Expose storage metrics and dropped chunks in prometheus endpoint

Is your feature request related to a problem? We've been trying to configure fluent-bit in system with high throughput and cardinality of records. In our scenario, data consistency is not relevant. We do prefer sampling records and fail fast, in favor of having a healthy pipeline with constant throughput.

We've tried several buffer configurations, but under heavy load we consistently get the same results. With retry mechanisms fluent-bit accumulates data over time until the buffer becomes full. At the point, the throughout (rps) drops as the pipeline starts dropping chunks and adding new ones.

The metrics are only exposed in a json endpoint, making it hard to collect and aggregate them. Event though I could see error logs of chunks being dropped, I couldn't see any metrics on the output storage.

Describe the solution you'd like

Related issue to disable retries
Add storage metrics to Prometheus endpoint
Add metrics for chunks being dropped.

Additional context

Without storage, chunk metrics. Is hard to evaluate the health of the pipeline until the issue becomes symptomatic when the buffer is full.

Mar 12 '21 05:03 cristiamu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 12 '21 02:04 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Apr 18 '21 02:04 github-actions[bot]

Re-openning, related to other additional metrics required

Apr 18 '21 02:04 agup006

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 20 '21 01:05 github-actions[bot]

FYI:

curl -s http://127.0.0.1:2020/api/v1/storage | jq
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 83,
      "mem_chunks": 0,
      "fs_chunks": 83,
      "fs_chunks_up": 83,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "tail.0": {
      "status": {
        "overlimit": false,
        "mem_size": "163.0M",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 83,
        "up": 83,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "storage_backlog.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

May 20 '21 02:05 edsiper

I find the metrics in api/v1/storage JSON output to be not helpful. I'd have to write some custom parser/converter to change the human readable size/limit output to proper integer values. Native Prometheus metrics that expose those metrics in proper integer types would be really helpful for monitoring.

May 31 '21 11:05 tarrychk

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 01 '21 01:07 github-actions[bot]

An additional note, the single threaded event model from the pipeline can also become a source of contention. Depending on the load and configuration, the pipeline itself can become a bottle neck. Tail plugins have a way to ignore old files, but there is no way to tell if we are buffering on disk unprocessed files because the pipeline is at full capacity.

Jul 20 '21 17:07 cristiamu

FYI:

curl -s http://127.0.0.1:2020/api/v1/storage | jq
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 83,
      "mem_chunks": 0,
      "fs_chunks": 83,
      "fs_chunks_up": 83,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "tail.0": {
      "status": {
        "overlimit": false,
        "mem_size": "163.0M",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 83,
        "up": 83,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "storage_backlog.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

Yes, we were aware of this endpoint but is not practical to collect metrics. That is why the request is explicit about expose them on Prometheus endpoint.

Aug 03 '21 23:08 cristiamu

FYI:

we are working on a v2 api endpoint for metrics that will have all metrics together

Aug 03 '21 23:08 edsiper

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Sep 03 '21 01:09 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Sep 09 '21 01:09 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Sep 15 '21 01:09 github-actions[bot]

Work in progress, we have moved the metrics to use the new c-metrics / prometheus base

Sep 15 '21 03:09 agup006

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Oct 16 '21 01:10 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Oct 22 '21 01:10 github-actions[bot]

Work in progress, we have moved the metrics to use the new c-metrics / prometheus base

Hey! Is there a roadmap or relevant PR planned for this feature so that we can track it there? Additionally, what I am curious is that @edsiper mentioned that there will be/v2 API endpoint for the Prometheus metrics. So is that means, we will expect something like /api/v2/metrics/prometheus eventually (which will include all monitoring endpoints)?

Oct 31 '21 22:10 Dentrax

Kind ping 🙏 @agup006

Nov 17 '21 10:11 Dentrax

Looking forward to this feature! Is there anything that I can help? @edsiper Not making any promises but I'd be interested in looking into it if you show me the direction. 🙏

Jan 07 '22 08:01 Dentrax

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Apr 08 '22 02:04 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Apr 13 '22 02:04 github-actions[bot]

@edsiper Can you please re-open the issue and mark as unstale? Thanks!

May 10 '22 07:05 Dentrax

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Aug 09 '22 02:08 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

Aug 15 '22 02:08 github-actions[bot]

Can someone please add exempt-stale label?

Aug 29 '22 09:08 Dentrax

FYI:

we are working on a v2 api endpoint for metrics that will have all metrics together

Kind ping @edsiper 🤞 Hit this issue again, any updates on v2 api endpoint? 👀

Sep 28 '22 12:09 Dentrax

I'm also interested on this!

Nov 29 '22 18:11 alanprot

Let's add a new issue for v2 API endpoint that includes everything. As a workaround you can use the fluentbit_metrics plugin and then a custom prometheus exporter plugin to do this today under a single endpoint @alanprot @Dentrax

Nov 29 '22 19:11 agup006

fluent-bit fluent-bit copied to clipboard

Expose storage metrics and dropped chunks in prometheus endpoint

fluent-bit
fluent-bit copied to clipboard