fluent-bit
fluent-bit copied to clipboard
Expose storage metrics and dropped chunks in prometheus endpoint
Is your feature request related to a problem? We've been trying to configure fluent-bit in system with high throughput and cardinality of records. In our scenario, data consistency is not relevant. We do prefer sampling records and fail fast, in favor of having a healthy pipeline with constant throughput.
We've tried several buffer configurations, but under heavy load we consistently get the same results. With retry mechanisms fluent-bit accumulates data over time until the buffer becomes full. At the point, the throughout (rps) drops as the pipeline starts dropping chunks and adding new ones.
The metrics are only exposed in a json endpoint, making it hard to collect and aggregate them. Event though I could see error logs of chunks being dropped, I couldn't see any metrics on the output storage.
Describe the solution you'd like
- Related issue to disable retries
- Add storage metrics to Prometheus endpoint
- Add metrics for chunks being dropped.
Additional context
- Without storage, chunk metrics. Is hard to evaluate the health of the pipeline until the issue becomes symptomatic when the buffer is full.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
Re-openning, related to other additional metrics required
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
FYI:
curl -s http://127.0.0.1:2020/api/v1/storage | jq
{
"storage_layer": {
"chunks": {
"total_chunks": 83,
"mem_chunks": 0,
"fs_chunks": 83,
"fs_chunks_up": 83,
"fs_chunks_down": 0
}
},
"input_chunks": {
"tail.0": {
"status": {
"overlimit": false,
"mem_size": "163.0M",
"mem_limit": "0b"
},
"chunks": {
"total": 83,
"up": 83,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
},
"storage_backlog.1": {
"status": {
"overlimit": false,
"mem_size": "0b",
"mem_limit": "0b"
},
"chunks": {
"total": 0,
"up": 0,
"down": 0,
"busy": 0,
"busy_size": "0b"
}
}
}
}
I find the metrics in api/v1/storage JSON output to be not helpful. I'd have to write some custom parser/converter to change the human readable size/limit output to proper integer values. Native Prometheus metrics that expose those metrics in proper integer types would be really helpful for monitoring.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
An additional note, the single threaded event model from the pipeline can also become a source of contention. Depending on the load and configuration, the pipeline itself can become a bottle neck. Tail plugins have a way to ignore old files, but there is no way to tell if we are buffering on disk unprocessed files because the pipeline is at full capacity.
FYI:
curl -s http://127.0.0.1:2020/api/v1/storage | jq { "storage_layer": { "chunks": { "total_chunks": 83, "mem_chunks": 0, "fs_chunks": 83, "fs_chunks_up": 83, "fs_chunks_down": 0 } }, "input_chunks": { "tail.0": { "status": { "overlimit": false, "mem_size": "163.0M", "mem_limit": "0b" }, "chunks": { "total": 83, "up": 83, "down": 0, "busy": 0, "busy_size": "0b" } }, "storage_backlog.1": { "status": { "overlimit": false, "mem_size": "0b", "mem_limit": "0b" }, "chunks": { "total": 0, "up": 0, "down": 0, "busy": 0, "busy_size": "0b" } } } }
Yes, we were aware of this endpoint but is not practical to collect metrics. That is why the request is explicit about expose them on Prometheus endpoint.
FYI:
we are working on a v2 api endpoint for metrics that will have all metrics together
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
This issue was closed because it has been stalled for 5 days with no activity.
Work in progress, we have moved the metrics to use the new c-metrics / prometheus base
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.
Work in progress, we have moved the metrics to use the new c-metrics / prometheus base
Hey! Is there a roadmap or relevant PR planned for this feature so that we can track it there? Additionally, what I am curious is that @edsiper mentioned that there will be/v2
API endpoint for the Prometheus metrics. So is that means, we will expect something like /api/v2/metrics/prometheus
eventually (which will include all monitoring endpoints)?
Kind ping 🙏 @agup006
Looking forward to this feature! Is there anything that I can help? @edsiper Not making any promises but I'd be interested in looking into it if you show me the direction. 🙏
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
@edsiper Can you please re-open the issue and mark as unstale? Thanks!
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
This issue was closed because it has been stalled for 5 days with no activity.
Can someone please add exempt-stale
label?
FYI:
we are working on a v2 api endpoint for metrics that will have all metrics together
Kind ping @edsiper 🤞 Hit this issue again, any updates on v2
api endpoint? 👀
I'm also interested on this!
Let's add a new issue for v2 API endpoint that includes everything. As a workaround you can use the fluentbit_metrics plugin and then a custom prometheus exporter plugin to do this today under a single endpoint @alanprot @Dentrax