fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

fluentbit_ metrics stop being sent to prometheus_remote_write output about 1 hour after start

Open g-cos opened this issue 1 month ago • 9 comments

Bug Report

Approximately one hour after fluentbit starts, all fluentbit_ internal metrics begin to be omitted from what is written with the prometheus_remote_write output. These are all of the metrics from the fluentbit_metrics input. This continues indefinitely, until fluentbit is restarted; these metrics never start getting written again by the existing process.

Metrics from any other inputs that produce metrics, such as prometheus_scrape and prometheus_textfile, continue to be sent normally. Also, if a prometheus_exporter output is configured, the fluentbit_metrics metrics are still exported there.

To Reproduce I can reproduce this with a minimal configuration, running on my local macbook. After starting up Victoria Metrics listening on localhost:8428, I run fluent-bit with this config:

---
service:
  flush: 1
  daemon: Off
  log_level: debug
  # Enable/Disable the built-in HTTP Server for metrics
  http_server: Off
  http_listen: 127.0.0.1
  http_port: 2020

pipeline:
  inputs:
    - name: fluentbit_metrics
      tag: metrics_fluentbit
      scrape_interval: 60s

  outputs:
    - name: prometheus_remote_write
      match: 'metrics_*'
      host: localhost
      port: 8428
      uri: /api/v1/write
      retry_limit: 2
      log_response_payload: True
      tls: Off
      add_label: job fluentbit2

Metrics such as fluentbit_output_upstream_total_connections and fluentbit_build_info begin appearing immediately, but cease after approximately one hour. After that time, fluentbit continues to log that it is sending prometheus remote writes, and continues to log HTTP status=204 and FLB_OK, but those metrics cease.

If I add an additional input with any other metrics, those metrics continue to be sent. For example, I created a file /tmp/node_info.prom with a single static metric, and added this input to the config:

    - name: prometheus_textfile
      tag: metrics_textfile
      path: /tmp/node_info.prom
      scrape_interval: 60s

After the fluentbit_ metrics ceased, this one additional metric continued to be sent for as long as the fluentbit process ran, which was more than a day in a couple of my tests.

Your Environment

  • Version used: 4.0.3, 4.0.8, 4.1.1 (I reproduced with the same minimal config in all three of these versions)
  • Configuration: See above
  • Server type and version: Macbook Pro, and AWS EC2
  • Operating System and version: macOS Sequoia 15.7.1 and Amazon Linux 2023

Additional context We started observing this issue earlier this month. We're using fluentbit metrics to monitor fluentbit and possibly alert on problems, but can no longer do so because these metrics are no longer being sent consistently.

g-cos avatar Oct 29 '25 13:10 g-cos

@cosmo0920 would you be able to take a look?

eschabell avatar Oct 29 '25 15:10 eschabell

Potentially related or maybe just a red herring: I noticed that when fluentbit is sending the fluentbit_ metrics, Victoria Metrics logs this message each time a remote write from fluentbit happens:

2025-10-29T12:34:28.806Z        warn    VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:77   resetting rollup result cache because the metric fluentbit_logger_logs_total{job="fluentbit2",message_type="error"} (Timestamp=1761737727519, Value=0.000000) has a timestamp older than -search.cacheTimestampOffset=5m0s by 3240.481s

This doesn't happen on the first few remote writes after I start fluentbit, it starts on the fifth one (based on following the fluentbit debug logs). The number at the end ("by 3240.481s") starts small, <5 seconds the first time it's logged, and eventually builds up to around 3240 seconds on the last log message; once fluentbit_ metrics are no longer being sent, Victoria Metrics no longer logs any warnings when remote writes happen.

g-cos avatar Oct 29 '25 16:10 g-cos

@cosmo0920 Could you clarify what that PR does? The PR description talks about "log related metrics", and says it is a new feature. The metrics we're not losing are not all related to logs, they are all fluentbit_ metrics including things like fluentbit_build_info. Also, before 4.x, these metrics were consistently emitted for the life of the process, so this isn't a new feature, it's a regression. Can you help me understand if your PR addresses this regression?

g-cos avatar Oct 30 '25 15:10 g-cos

@cosmo0920 Could you clarify what that PR does? The PR description talks about "log related metrics", and says it is a new feature. The metrics we're not losing are not all related to logs, they are all fluentbit_ metrics including things like fluentbit_build_info.

2025-10-29T12:34:28.806Z        warn    VictoriaMetrics/app/vmselect/promql/rollup_result_cache.go:77   resetting rollup result cache because the metric fluentbit_logger_logs_total{job="fluentbit2",message_type="error"} (Timestamp=1761737727519, Value=0.000000) has a timestamp older than -search.cacheTimestampOffset=5m0s by 3240.481s

This log is surely related to fluentbit_logger_logs_total which means that the logs of metrics could be stuck due to inactive type of logs especially error level of logs. This is because error level of logs shouldn't be raised frequently. So, we need to warm up by several intervals.

cosmo0920 avatar Oct 31 '25 08:10 cosmo0920

Ah, got it. The fluentbit metrics should be persistent as-is when created at the startup. These type of metrics is not updated with the current timestamps. So, the stuck metrics should be vanished after 3600 minutes because of the metrics cutoff.

cosmo0920 avatar Oct 31 '25 08:10 cosmo0920

Ah, got it. The fluentbit metrics should be persistent as-is when created at the startup. These type of metrics is not updated with the current timestamps. So, the stuck metrics should be vanished after 3600 minutes because of the metrics cutoff.

What about metrics like fluentbit_output_upstream_total_connections, the other example I gave? What does it mean for it to be "persistent as-is when created at the startup"? Its value reflect the current state, not the state at startup, so it could be changing frequently. There are many fluentbit_ metrics related to the operation of fluentbit (records dropped, bytes filtered, etc.), that are not necessarily logs-related - this is just one example.

g-cos avatar Oct 31 '25 15:10 g-cos

Ah, got it. The fluentbit metrics should be persistent as-is when created at the startup. These type of metrics is not updated with the current timestamps. So, the stuck metrics should be vanished after 3600 minutes because of the metrics cutoff.

What about metrics like fluentbit_output_upstream_total_connections, the other example I gave? What does it mean for it to be "persistent as-is when created at the startup"?

Each of fluentbit_xxx metrics differs from the intervals of updating their values. fluentbit_buildinfo is one of the examples of mostly inactive and oneshot metrics. So, these types of metrics do not have a way to update their intervals on every cycles of sending metrics payloads for sending on prometheus remote write protocol.

cosmo0920 avatar Nov 02 '25 12:11 cosmo0920

Also had a chance to test 4.0.10 this week for a different reason, but confirmed that it also has this bug.

g-cos avatar Nov 11 '25 16:11 g-cos

We were testing fluentbit 4.2.0 so I tried the repro config for this bug, and was able to reproduce it locally with this version as well. This time, charted a number of other fluentbit_ metrics to see if they all drop at the same time. I was able to find two exceptions: both fluentbit_output_proc_bytes_total and fluentbit_output_proc_records_total continue to be emitted by the fluentbit process for as long as it runs. I wasn't able to find any others, though.

Examples of metrics the fluentbit stops emitting about 1 hour after the process starts:

  • fluentbit_process_start_time_seconds
  • fluentbit_output_upstream_total_connections
  • fluentbit_output_latency_seconds_count
  • fluentbit_input_ring_buffer_writes_total
  • fluentbit_build_info

Since I can reproduce this in a fluentbit config that neither reads nor sends any logs, this doesn't seem to be related to logs or log metrics.

g-cos avatar Nov 29 '25 23:11 g-cos