fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Expose storage metrics and dropped chunks in prometheus endpoint

Open cristiamu opened this issue 3 years ago • 23 comments

Is your feature request related to a problem? We've been trying to configure fluent-bit in system with high throughput and cardinality of records. In our scenario, data consistency is not relevant. We do prefer sampling records and fail fast, in favor of having a healthy pipeline with constant throughput.

We've tried several buffer configurations, but under heavy load we consistently get the same results. With retry mechanisms fluent-bit accumulates data over time until the buffer becomes full. At the point, the throughout (rps) drops as the pipeline starts dropping chunks and adding new ones.

The metrics are only exposed in a json endpoint, making it hard to collect and aggregate them. Event though I could see error logs of chunks being dropped, I couldn't see any metrics on the output storage.

Describe the solution you'd like

  • Related issue to disable retries
  • Add storage metrics to Prometheus endpoint
  • Add metrics for chunks being dropped.

Additional context

  • Without storage, chunk metrics. Is hard to evaluate the health of the pipeline until the issue becomes symptomatic when the buffer is full.

cristiamu avatar Mar 12 '21 05:03 cristiamu

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Apr 12 '21 02:04 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Apr 18 '21 02:04 github-actions[bot]

Re-openning, related to other additional metrics required

agup006 avatar Apr 18 '21 02:04 agup006

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 20 '21 01:05 github-actions[bot]

FYI:

curl -s http://127.0.0.1:2020/api/v1/storage | jq
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 83,
      "mem_chunks": 0,
      "fs_chunks": 83,
      "fs_chunks_up": 83,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "tail.0": {
      "status": {
        "overlimit": false,
        "mem_size": "163.0M",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 83,
        "up": 83,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "storage_backlog.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

edsiper avatar May 20 '21 02:05 edsiper

I find the metrics in api/v1/storage JSON output to be not helpful. I'd have to write some custom parser/converter to change the human readable size/limit output to proper integer values. Native Prometheus metrics that expose those metrics in proper integer types would be really helpful for monitoring.

tarrychk avatar May 31 '21 11:05 tarrychk

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 01 '21 01:07 github-actions[bot]

An additional note, the single threaded event model from the pipeline can also become a source of contention. Depending on the load and configuration, the pipeline itself can become a bottle neck. Tail plugins have a way to ignore old files, but there is no way to tell if we are buffering on disk unprocessed files because the pipeline is at full capacity.

cristiamu avatar Jul 20 '21 17:07 cristiamu

FYI:

curl -s http://127.0.0.1:2020/api/v1/storage | jq
{
  "storage_layer": {
    "chunks": {
      "total_chunks": 83,
      "mem_chunks": 0,
      "fs_chunks": 83,
      "fs_chunks_up": 83,
      "fs_chunks_down": 0
    }
  },
  "input_chunks": {
    "tail.0": {
      "status": {
        "overlimit": false,
        "mem_size": "163.0M",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 83,
        "up": 83,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    },
    "storage_backlog.1": {
      "status": {
        "overlimit": false,
        "mem_size": "0b",
        "mem_limit": "0b"
      },
      "chunks": {
        "total": 0,
        "up": 0,
        "down": 0,
        "busy": 0,
        "busy_size": "0b"
      }
    }
  }
}

Yes, we were aware of this endpoint but is not practical to collect metrics. That is why the request is explicit about expose them on Prometheus endpoint.

cristiamu avatar Aug 03 '21 23:08 cristiamu

FYI:

we are working on a v2 api endpoint for metrics that will have all metrics together

edsiper avatar Aug 03 '21 23:08 edsiper

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Sep 03 '21 01:09 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Sep 09 '21 01:09 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Sep 15 '21 01:09 github-actions[bot]

Work in progress, we have moved the metrics to use the new c-metrics / prometheus base

agup006 avatar Sep 15 '21 03:09 agup006

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 16 '21 01:10 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Oct 22 '21 01:10 github-actions[bot]

Work in progress, we have moved the metrics to use the new c-metrics / prometheus base

Hey! Is there a roadmap or relevant PR planned for this feature so that we can track it there? Additionally, what I am curious is that @edsiper mentioned that there will be/v2 API endpoint for the Prometheus metrics. So is that means, we will expect something like /api/v2/metrics/prometheus eventually (which will include all monitoring endpoints)?

Dentrax avatar Oct 31 '21 22:10 Dentrax

Kind ping 🙏 @agup006

Dentrax avatar Nov 17 '21 10:11 Dentrax

Looking forward to this feature! Is there anything that I can help? @edsiper Not making any promises but I'd be interested in looking into it if you show me the direction. 🙏

Dentrax avatar Jan 07 '22 08:01 Dentrax

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Apr 08 '22 02:04 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Apr 13 '22 02:04 github-actions[bot]

@edsiper Can you please re-open the issue and mark as unstale? Thanks!

Dentrax avatar May 10 '22 07:05 Dentrax

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Aug 09 '22 02:08 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Aug 15 '22 02:08 github-actions[bot]

Can someone please add exempt-stale label?

Dentrax avatar Aug 29 '22 09:08 Dentrax

FYI:

we are working on a v2 api endpoint for metrics that will have all metrics together

Kind ping @edsiper 🤞 Hit this issue again, any updates on v2 api endpoint? 👀

Dentrax avatar Sep 28 '22 12:09 Dentrax

I'm also interested on this!

alanprot avatar Nov 29 '22 18:11 alanprot

Let's add a new issue for v2 API endpoint that includes everything. As a workaround you can use the fluentbit_metrics plugin and then a custom prometheus exporter plugin to do this today under a single endpoint @alanprot @Dentrax

agup006 avatar Nov 29 '22 19:11 agup006