fluentbit_ metrics stop being sent to prometheus_remote_write output about 1 hour after start
Bug Report
Approximately one hour after fluentbit starts, all fluentbit_ internal metrics begin to be omitted from what is written with the prometheus_remote_write output. These are all of the metrics from the fluentbit_metrics input. This continues indefinitely, until fluentbit is restarted; these metrics never start getting written again by the existing process.
Metrics from any other inputs that produce metrics, such as prometheus_scrape and prometheus_textfile, continue to be sent normally. Also, if a prometheus_exporter output is configured, the fluentbit_metrics metrics are still exported there.
To Reproduce I can reproduce this with a minimal configuration, running on my local macbook. After starting up Victoria Metrics listening on localhost:8428, I run fluent-bit with this config:
---
service:
flush: 1
daemon: Off
log_level: debug
# Enable/Disable the built-in HTTP Server for metrics
http_server: Off
http_listen: 127.0.0.1
http_port: 2020
pipeline:
inputs:
- name: fluentbit_metrics
tag: metrics_fluentbit
scrape_interval: 60s
outputs:
- name: prometheus_remote_write
match: 'metrics_*'
host: localhost
port: 8428
uri: /api/v1/write
retry_limit: 2
log_response_payload: True
tls: Off
add_label: job fluentbit2
Metrics such as fluentbit_output_upstream_total_connections and fluentbit_build_info begin appearing immediately, but cease after approximately one hour. After that time, fluentbit continues to log that it is sending prometheus remote writes, and continues to log HTTP status=204 and FLB_OK, but those metrics cease.
If I add an additional input with any other metrics, those metrics continue to be sent. For example, I created a file /tmp/node_info.prom with a single static metric, and added this input to the config:
- name: prometheus_textfile
tag: metrics_textfile
path: /tmp/node_info.prom
scrape_interval: 60s
After the fluentbit_ metrics ceased, this one additional metric continued to be sent for as long as the fluentbit process ran, which was more than a day in a couple of my tests.
Your Environment
- Version used: 4.0.3, 4.0.8, 4.1.1 (I reproduced with the same minimal config in all three of these versions)
- Configuration: See above
- Server type and version: Macbook Pro, and AWS EC2
- Operating System and version: macOS Sequoia 15.7.1 and Amazon Linux 2023
Additional context We started observing this issue earlier this month. We're using fluentbit metrics to monitor fluentbit and possibly alert on problems, but can no longer do so because these metrics are no longer being sent consistently.
@cosmo0920 would you be able to take a look?
Potentially related or maybe just a red herring: I noticed that when fluentbit is sending the fluentbit_ metrics, Victoria Metrics logs this message each time a remote write from fluentbit happens:
2025-10-29T12:34:28.806Z warn VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:77 resetting rollup result cache because the metric fluentbit_logger_logs_total{job="fluentbit2",message_type="error"} (Timestamp=1761737727519, Value=0.000000) has a timestamp older than -search.cacheTimestampOffset=5m0s by 3240.481s
This doesn't happen on the first few remote writes after I start fluentbit, it starts on the fifth one (based on following the fluentbit debug logs). The number at the end ("by 3240.481s") starts small, <5 seconds the first time it's logged, and eventually builds up to around 3240 seconds on the last log message; once fluentbit_ metrics are no longer being sent, Victoria Metrics no longer logs any warnings when remote writes happen.
@cosmo0920 Could you clarify what that PR does? The PR description talks about "log related metrics", and says it is a new feature. The metrics we're not losing are not all related to logs, they are all fluentbit_ metrics including things like fluentbit_build_info. Also, before 4.x, these metrics were consistently emitted for the life of the process, so this isn't a new feature, it's a regression. Can you help me understand if your PR addresses this regression?
@cosmo0920 Could you clarify what that PR does? The PR description talks about "log related metrics", and says it is a new feature. The metrics we're not losing are not all related to logs, they are all
fluentbit_metrics including things likefluentbit_build_info.
2025-10-29T12:34:28.806Z warn VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:77 resetting rollup result cache because the metric fluentbit_logger_logs_total{job="fluentbit2",message_type="error"} (Timestamp=1761737727519, Value=0.000000) has a timestamp older than -search.cacheTimestampOffset=5m0s by 3240.481s
This log is surely related to fluentbit_logger_logs_total which means that the logs of metrics could be stuck due to inactive type of logs especially error level of logs. This is because error level of logs shouldn't be raised frequently. So, we need to warm up by several intervals.
Ah, got it. The fluentbit metrics should be persistent as-is when created at the startup. These type of metrics is not updated with the current timestamps. So, the stuck metrics should be vanished after 3600 minutes because of the metrics cutoff.
Ah, got it. The fluentbit metrics should be persistent as-is when created at the startup. These type of metrics is not updated with the current timestamps. So, the stuck metrics should be vanished after 3600 minutes because of the metrics cutoff.
What about metrics like fluentbit_output_upstream_total_connections, the other example I gave? What does it mean for it to be "persistent as-is when created at the startup"? Its value reflect the current state, not the state at startup, so it could be changing frequently. There are many fluentbit_ metrics related to the operation of fluentbit (records dropped, bytes filtered, etc.), that are not necessarily logs-related - this is just one example.
Ah, got it. The fluentbit metrics should be persistent as-is when created at the startup. These type of metrics is not updated with the current timestamps. So, the stuck metrics should be vanished after 3600 minutes because of the metrics cutoff.
What about metrics like
fluentbit_output_upstream_total_connections, the other example I gave? What does it mean for it to be "persistent as-is when created at the startup"?
Each of fluentbit_xxx metrics differs from the intervals of updating their values. fluentbit_buildinfo is one of the examples of mostly inactive and oneshot metrics. So, these types of metrics do not have a way to update their intervals on every cycles of sending metrics payloads for sending on prometheus remote write protocol.
Also had a chance to test 4.0.10 this week for a different reason, but confirmed that it also has this bug.
We were testing fluentbit 4.2.0 so I tried the repro config for this bug, and was able to reproduce it locally with this version as well. This time, charted a number of other fluentbit_ metrics to see if they all drop at the same time. I was able to find two exceptions: both fluentbit_output_proc_bytes_total and fluentbit_output_proc_records_total continue to be emitted by the fluentbit process for as long as it runs. I wasn't able to find any others, though.
Examples of metrics the fluentbit stops emitting about 1 hour after the process starts:
fluentbit_process_start_time_secondsfluentbit_output_upstream_total_connectionsfluentbit_output_latency_seconds_countfluentbit_input_ring_buffer_writes_totalfluentbit_build_info
Since I can reproduce this in a fluentbit config that neither reads nor sends any logs, this doesn't seem to be related to logs or log metrics.