fluent-bit Prometheus Exporter metrics with different tags should have only one HELP and TYPE comment line

Bug Report

The exported Prometheus metrics with same name but different tags have duplicate HELP and TYPE comment lines. According to https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details, it only allows one HELP/TYPE for any given metric.

To Reproduce

Rubular link if applicable: NA
Example log message if applicable: NA
Steps to reproduce the problem: setup a fluent-bit configuration with different output, as in configuration part in following.

Expected behavior Only one line of TYPE and HELP should be generated for fluent-bit output metrics. But there duplicate ones as in screenshot.

Screenshots Screen Shot 2022-04-08 at 5 59 19 PM As in highlight above, there are duplicate TYPE/HELP comment lines.

Your Environment

Version used: 1.9.3
Configuration:

[OUTPUT] Name http Alias confiant Match bids ... [OUTPUT] Name s3 Alias s3 Match bids region {{ .Values.s3RegionForBids }} bucket {{ .Values.s3BucketForBids }} ... [OUTPUT] Name prometheus_exporter Alias exporter match internal_metrics ...
Environment name and version (e.g. Kubernetes? What version?): K8S
Server type and version: EKS
Operating System and version: x86_64 Linux 5.4
Filters and plugins: no filters, output plugin as in configuration.

Additional context It cause issues when we try to feed those metrics to our monitoring system, as according to https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details, it only allows one HELP/TYPE for any given metric.

May 17 '22 06:05 yingchen0706v

I cannot seem to reproduce this on 1.9.3 with this config as a test case:

[SERVICE]
  Http_server On

[INPUT]
  name dummy
  tag dummy1

[INPUT]
  name dummy
  tag dummy2
  
[OUTPUT]
  name stdout
  match dummy1

[OUTPUT]
  name stdout
  match dummy2

[OUTPUT]
  Name http
  match nothing

Run up the container and curl the output:

$ docker run --rm -d -p 2020:2020 -v $PWD/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf fluent/fluent-bit:1.9.3
$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.0"} 468 1652771931407
fluentbit_input_bytes_total{name="dummy.1"} 468 1652771931407
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.0"} 18 1652771931407
fluentbit_input_records_total{name="dummy.1"} 18 1652771931407
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="http.2"} 0 1652771931407
fluentbit_output_dropped_records_total{name="stdout.0"} 0 1652771931407
fluentbit_output_dropped_records_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="http.2"} 0 1652771931407
fluentbit_output_errors_total{name="stdout.0"} 0 1652771931407
fluentbit_output_errors_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="http.2"} 0 1652771931407
fluentbit_output_proc_bytes_total{name="stdout.0"} 416 1652771931407
fluentbit_output_proc_bytes_total{name="stdout.1"} 416 1652771931407
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="http.2"} 0 1652771931407
fluentbit_output_proc_records_total{name="stdout.0"} 16 1652771931407
fluentbit_output_proc_records_total{name="stdout.1"} 16 1652771931407
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="http.2"} 0 1652771931407
fluentbit_output_retried_records_total{name="stdout.0"} 0 1652771931407
fluentbit_output_retried_records_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="http.2"} 0 1652771931407
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1652771931407
fluentbit_output_retries_failed_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="http.2"} 0 1652771931407
fluentbit_output_retries_total{name="stdout.0"} 0 1652771931407
fluentbit_output_retries_total{name="stdout.1"} 0 1652771931407
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime 18
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1652771913
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{version="1.9.3",edition="Community"} 1

May 17 '22 07:05 patrick-stephens

Ah, this seems to be an issue with the Prometheus Exporter itself: it we use that with the recent Fluent Bit metrics input plugin then it generates invalid output:

[SERVICE]
  Http_server On

[INPUT]
  name dummy
  tag dummy1

[INPUT]
  name dummy
  tag dummy2
  
[OUTPUT]
  name stdout
  match dummy1

[OUTPUT]
  name stdout
  match dummy2

[OUTPUT]
  Name http
  match nothing

[INPUT]
  name            fluentbit_metrics
  tag             internal_metrics

[OUTPUT]
  name            prometheus_exporter
  match           internal_metrics
  port            2021

Run and check then to see the incorrect output - make sure to expose the 2021 port now:

$ docker run --rm -d -p 2020:2020 -p 2021:2021 -v $PWD/fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf fluent/fluent-bit:1.9.3
$ curl -s http://127.0.0.1:2021/metrics
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime{hostname="653a853d661c"} 121
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.0"} 3146
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.0"} 121
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.1"} 3146
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.1"} 121
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="fluentbit_metrics.2"} 520260
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="fluentbit_metrics.2"} 60
# HELP fluentbit_input_metrics_scrapes_total Number of total metrics scrapes
# TYPE fluentbit_input_metrics_scrapes_total counter
fluentbit_input_metrics_scrapes_total{name="fluentbit_metrics.2"} 61
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 120
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 3120
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="stdout.0"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="stdout.0"} 0
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.1"} 120
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.1"} 3120
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.1"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.1"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.1"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="stdout.1"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="stdout.1"} 0
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="http.2"} 0
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="http.2"} 0
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="http.2"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="http.2"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="http.2"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="http.2"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="http.2"} 0
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="prometheus_exporter.3"} 60
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="prometheus_exporter.3"} 520260
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="prometheus_exporter.3"} 0
# HELP fluentbit_process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE fluentbit_process_start_time_seconds gauge
fluentbit_process_start_time_seconds{hostname="653a853d661c"} 1652772686
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{hostname="653a853d661c",version="1.9.3",os="linux"} 1652772686

The webserver output is fine:

$ curl -s http://127.0.0.1:2020/api/v1/metrics/prometheus
# HELP fluentbit_input_bytes_total Number of input bytes.
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="dummy.0"} 3926 1652772837034
fluentbit_input_bytes_total{name="dummy.1"} 3926 1652772837034
fluentbit_input_bytes_total{name="fluentbit_metrics.2"} 650325 1652772837034
# HELP fluentbit_input_records_total Number of input records.
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="dummy.0"} 151 1652772837034
fluentbit_input_records_total{name="dummy.1"} 151 1652772837034
fluentbit_input_records_total{name="fluentbit_metrics.2"} 75 1652772837034
# HELP fluentbit_output_dropped_records_total Number of dropped records.
# TYPE fluentbit_output_dropped_records_total counter
fluentbit_output_dropped_records_total{name="http.2"} 0 1652772837034
fluentbit_output_dropped_records_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_dropped_records_total{name="stdout.0"} 0 1652772837034
fluentbit_output_dropped_records_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_errors_total Number of output errors.
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="http.2"} 0 1652772837034
fluentbit_output_errors_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_errors_total{name="stdout.0"} 0 1652772837034
fluentbit_output_errors_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_proc_bytes_total Number of processed output bytes.
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="http.2"} 0 1652772837034
fluentbit_output_proc_bytes_total{name="prometheus_exporter.3"} 641654 1652772837034
fluentbit_output_proc_bytes_total{name="stdout.0"} 3874 1652772837034
fluentbit_output_proc_bytes_total{name="stdout.1"} 3874 1652772837034
# HELP fluentbit_output_proc_records_total Number of processed output records.
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="http.2"} 0 1652772837034
fluentbit_output_proc_records_total{name="prometheus_exporter.3"} 74 1652772837034
fluentbit_output_proc_records_total{name="stdout.0"} 149 1652772837034
fluentbit_output_proc_records_total{name="stdout.1"} 149 1652772837034
# HELP fluentbit_output_retried_records_total Number of retried records.
# TYPE fluentbit_output_retried_records_total counter
fluentbit_output_retried_records_total{name="http.2"} 0 1652772837034
fluentbit_output_retried_records_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_retried_records_total{name="stdout.0"} 0 1652772837034
fluentbit_output_retried_records_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_retries_failed_total Number of abandoned batches because the maximum number of re-tries was reached.
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="http.2"} 0 1652772837034
fluentbit_output_retries_failed_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1652772837034
fluentbit_output_retries_failed_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_output_retries_total Number of output retries.
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="http.2"} 0 1652772837034
fluentbit_output_retries_total{name="prometheus_exporter.3"} 0 1652772837034
fluentbit_output_retries_total{name="stdout.0"} 0 1652772837034
fluentbit_output_retries_total{name="stdout.1"} 0 1652772837034
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.
# TYPE fluentbit_uptime counter
fluentbit_uptime 151
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1652772686
# HELP fluentbit_build_info Build version information.
# TYPE fluentbit_build_info gauge
fluentbit_build_info{version="1.9.3",edition="Community"} 1

May 17 '22 07:05 patrick-stephens

@patrick-stephens it works with default configuration, but the metrics are exported with endpoint /api/v1/metrics/prometheus instead of /metrcis. Is there a way to make it use /metrcis?

May 17 '22 08:05 yingchen0706v

I don't think so as those routes are part of the web server. Scrape config should handle it fine though, you just need to configure the path so it doesn't use the default on the Prometheus side

May 17 '22 08:05 patrick-stephens

thanks @patrick-stephens. I'll workaround it with other solution. Close the ticket for now. Thank you for help.

May 17 '22 08:05 yingchen0706v

I've re-opened this as it is a legitimate bug that will prevent use of the exporter. @leonardo-albertovich can you take a look?

The issue seems to be the metrics are not grouped together for related things with the exporter output but they are for the web server.

May 23 '22 10:05 patrick-stephens

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Aug 22 '22 02:08 github-actions[bot]

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

Nov 22 '22 02:11 github-actions[bot]

It seems that Prometheus and VictoriaMetrics can handle that situation well, however there are providers like Dynatrace which scrape only the first entry of every metric and drops the rest.

As the new mechanism using the fluentbit_metrics as input seems to be the future-safe solution (storage metrics are available here in prometheus format using the prometheus-exporter and it is much more flexible), it will be great if the problem could get solved so that the new mechanism can be adopted more widely.

Aug 15 '23 09:08 a-thaler

Ran into this bug today and confirmed that it is due to the fluentbit_metrics input plugin. Strangely, as @a-thaler mentioned, Prometheus itself has no issue parsing this malformed metrics text (despite the format violating its own spec). The Prometheus Go parser however fails with an error, as expected (see: https://github.com/prometheus/common/blob/main/expfmt/text_parse.go#L500)

It would be great if the fluentbit_metrics plugin be used with the Prometheus exporter output plugin and formatted properly. This would at the very least allow me to add additional labels to the metrics, which the monitoring API does not. That being said, the monitoring API metrics endpoint (mentioned here https://github.com/fluent/fluent-bit/issues/5465#issuecomment-1128519762) is a sufficient workaround for now, at least for my use case.

Aug 25 '23 17:08 ccampo133

I'm trying to pull in the fluentbit v2 metrics from fluentbit 3.1.3 into telegraf which uses the Go parser and am stuck because of this issue. It seems to only be with v2, which I need so that I can plot my storage buffer usages.

Jul 25 '24 23:07 randvoorhies

I’ve encountered this issue as well, specifically with the inability to use storage metrics available exclusively in the API/v2. Resolving this would significantly improve our monitoring capabilities. I hope this issue can be prioritized in future

Aug 12 '24 12:08 evgfitil

👋🏽

Looks like more ppl want to use the new Prometheus endpoint, but can't, due to a broken exposition format implementation. Any updates on this, or at least pointers what are the challenges?

(Not that it helps here, but I'm Prometheus maintainer here, open for feedback on our side how to make it easier for C codebases)

Aug 14 '24 17:08 bwplotka

I looked into this today, here's what I found.

The problem

When all the metrics are collected from each plugin, the cmt_cat function is used to append an entire cmt context into the single one that will eventually get sent down the line. This is done because each plugin gets its own separate cmt context, because each plugin has the opportunity to register its own metrics. However, each input, filter, and output plugin also sets up a set of default metrics separately in their own contexts.
Let’s use fluentbit_input_records_total as an example. This metric is registered for every input plugin using the tag in the name label. The registration happens independently in each cmt context for every new input plugin. This counter’s map contains one metric, the counter for this name label for this input plugin. When this context is collected, each counter gets appended to the context. Imagine there are 3 input plugins, and each one has its own metric context with a registered input_records_total. The problem is that Fluent Bit does not actually recognize that in the full cmt context that these metrics will be added to, there is already an input_records_total, and thus each will be registered as 3 different metrics. Once this gets to the process for encoding individual metrics, there will be a HELP and TYPE banner produced for each one separately, because they aren’t considered by cmetrics to be the same metric. In reality, what we would like is in the overall cmetrics payload with all metrics, there would be one metric representing input_records_total with 3 different metrics in its map for each of the 3 input plugins.

Solution

I began looking at this in an assistive capacity for another team; it isn't something that directly affects my work at this time. As such, it is unlikely I will be able to dedicate the time to develop and shepherd a fix myself. However, I've outlined what I think would be the two best possible ways to resolve this which someone else could take on.

Proposal 1: Shared metrics context for each plugin type

One path forward that I see is for all input plugins to share one metric context. This would be the same for filter and output plugins. In this case, the shared metrics context would be wrapped in a struct that also includes the addresses for each of the shared metrics, and when this shared context is passed into the initialization procedure of a new plugin instance, it simply records new values in the existing metrics.

I wrote a proof of concept for this just for input plugins: https://github.com/fluent/fluent-bit/pull/9231 Much of the code is a mess, but if you pull it down and build it, then use the following config:

[SERVICE]
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

[INPUT]
    Name cpu

[INPUT]
    Name cpu

[INPUT]
    Name cpu

[OUTPUT]
    Name  stdout
    Match *

You will see the resulting metrics being correctly grouped as the Prometheus Exposition Format specifies.

braydonk@bk:~/Documents/test_flb$ curl localhost:2020/api/v2/metrics/prometheus                                                                                                                                                                                                             
# HELP fluentbit_uptime Number of seconds that Fluent Bit has been running.                                                                                                                                                                                                                 
# TYPE fluentbit_uptime counter                                                                                                                                                                                                                                                             
fluentbit_uptime{hostname="bk.c.googlers.com"} 3                                                                                                                                                                                                                                            
# HELP fluentbit_input_bytes_total Number of input bytes.                                                                                                                                                                                                                                   
# TYPE fluentbit_input_bytes_total counter                                                                                                                                                                                                                                                  
fluentbit_input_bytes_total{name="cpu.0"} 6628                                                                                                                                                                                                                                              
fluentbit_input_bytes_total{name="cpu.1"} 4971                                                                                                                                                                                                                                              
fluentbit_input_bytes_total{name="cpu.2"} 4971                                                                                                                                                                                                                                              
# HELP fluentbit_input_records_total Number of input records.                                                                                                                                                                                                                               
# TYPE fluentbit_input_records_total counter                                                                                                                                                                                                                                                
fluentbit_input_records_total{name="cpu.0"} 4                                                                                                                                                                                                                                               
fluentbit_input_records_total{name="cpu.1"} 3                                                                                                                                                                                                                                               
fluentbit_input_records_total{name="cpu.2"} 3

If this approach would make sense, I would hand it off to @shuaich, my coworker from the team trying to tackle this problem. I think it is a straightforward enough implementation that I could provide the guidance to do relatively simply.

The only open question is thread safety; if using threaded input plugins, I'm not sure if the cmt context is designed for thread safety. Seems to work fine for threaded output plugins so I'm guessing it's okay, but never tried with threaded input plugins.

Proposal 2: Adjust `cmt_cat` to account for metrics that already exist

I'm not sure this is a great path forward, but I'll include it here. The other way I see to accomplish this is for cmt_cat to account for metrics that already exist in the destination context. i.e. if I'm appending a context that contains fluentbit_input_records_total, cmt_cat would need to recognize that fluentbit_input_records_total already exists, and instead of copying the entire metric add it as a value to the existing metric's cmt_map.

This would be a much harder implementation. I think this should only be considered if the thread safety of cmt isn't solid enough for Proposal 1.

If this were the direction chosen, I'd recommend a Fluent Bit maintainer take it on as it is not straightforward and has nuances deep in the library code that aren't straightforward to come up with as a standard community contributor.

Aug 14 '24 21:08 braydonk

CC @edsiper @leonardo-albertovich to look over my proposals

Aug 14 '24 21:08 braydonk

I'm also affected by this issue. The presence of duplicate "TYPE" lines breaks Telegraf's parsing.

decoding response failed: text format parsing error in line 10: second HELP line for metric name "fluentbit_input_bytes_total"

Aug 27 '24 17:08 bbkfhq

I've pushed a draft PR to CMetrics to fix this: https://github.com/fluent/cmetrics/pull/222

For testing purposes, I created a test branch of Fluent Bit here:

PR: https://github.com/fluent/fluent-bit/pull/9360
branch: https://github.com/fluent/fluent-bit/tree/cmetrics-test-eduardo-cat-fixes

folks, would you mind give it a try to the test branch ? any help is appreciated

Sep 05 '24 18:09 edsiper

Hi @edsiper I was able to reproduce the issue with Telegraf and Fluent Bit 3.1.7.

podman run --rm -v $(PWD)/telegraf.config:/etc/telegraf/telegraf.conf:ro --entrypoint=telegraf telegraf
2024-09-06T21:21:48Z I! Loading config: /etc/telegraf/telegraf.conf
2024-09-06T21:21:48Z I! Starting Telegraf 1.31.3 brought to you by InfluxData the makers of InfluxDB
2024-09-06T21:21:48Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-09-06T21:21:48Z I! Loaded inputs: prometheus
2024-09-06T21:21:48Z I! Loaded aggregators: 
2024-09-06T21:21:48Z I! Loaded processors: 
2024-09-06T21:21:48Z I! Loaded secretstores: 
2024-09-06T21:21:48Z I! Loaded outputs: exec
2024-09-06T21:21:48Z I! Tags enabled: host=8e2205bd85d8
2024-09-06T21:21:48Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"8e2205bd85d8", Flush Interval:10s
2024-09-06T21:21:48Z I! [inputs.prometheus] Using the label selector:  and field selector: 
2024-09-06T21:21:50Z E! [inputs.prometheus] Error in plugin: error reading metrics for "http://192.168.100.61:2020/api/v2/metrics/prometheus": decoding response failed: text format parsing error in line 10: second HELP line for metric name "fluentbit_input_bytes_total"
2024-09-06T21:22:00Z E! [inputs.prometheus] Error in plugin: error reading metrics for "http://192.168.100.61:2020/api/v2/metrics/prometheus": decoding response failed: text format parsing error in line 10: second HELP line for metric name "fluentbit_input_bytes_total"
^C2024-09-06T21:22:03Z I! [agent] Hang on, flushing any cached metrics before shutdown
2024-09-06T21:22:03Z I! [agent] Stopping running outputs

I've also used the branch from #9360 to validate the fix.

 podman run --rm -v $(PWD)/telegraf.config:/etc/telegraf/telegraf.conf:ro --entrypoint=telegraf telegraf
2024-09-06T21:28:04Z I! Loading config: /etc/telegraf/telegraf.conf
2024-09-06T21:28:04Z I! Starting Telegraf 1.31.3 brought to you by InfluxData the makers of InfluxDB
2024-09-06T21:28:04Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores
2024-09-06T21:28:04Z I! Loaded inputs: prometheus
2024-09-06T21:28:04Z I! Loaded aggregators: 
2024-09-06T21:28:04Z I! Loaded processors: 
2024-09-06T21:28:04Z I! Loaded secretstores: 
2024-09-06T21:28:04Z I! Loaded outputs: file
2024-09-06T21:28:04Z I! Tags enabled: host=519cbe732e08
2024-09-06T21:28:04Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"519cbe732e08", Flush Interval:10s
2024-09-06T21:28:04Z I! [inputs.prometheus] Using the label selector:  and field selector: 
fluentbit_uptime,host=519cbe732e08,hostname=chronolap.local,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=152 1725658090000000000
fluentbit_output_proc_records_total,host=519cbe732e08,name=stdout.0,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=1207 1725658090000000000
fluentbit_storage_fs_chunks_down,host=519cbe732e08,url=http://192.168.100.61:2020/api/v2/metrics/prometheus gauge=0 1725658090000000000
fluentbit_input_bytes_total,host=519cbe732e08,name=dummy.0,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=27324 1725658090000000000
fluentbit_input_bytes_total,host=519cbe732e08,name=dummy.1,url=http://192.168.100.61:2020/api/v2/metrics/prometheus counter=16416 1725658090000000000

Given that @shuaich already tested the Prometheus Golang scraper, I'd say the fix works.

Sep 06 '24 21:09 lecaros

fixed with https://github.com/fluent/fluent-bit/pull/9392 (master) and https://github.com/fluent/fluent-bit/pull/9393 (3.1)

Sep 16 '24 16:09 edsiper

fluent-bit fluent-bit copied to clipboard

Prometheus Exporter metrics with different tags should have only one HELP and TYPE comment line

Bug Report

The problem

Solution

Proposal 1: Shared metrics context for each plugin type

Proposal 2: Adjust cmt_cat to account for metrics that already exist

fluent-bit
fluent-bit copied to clipboard

Proposal 2: Adjust `cmt_cat` to account for metrics that already exist