fluent-bit Stuck in a loop with ES output and log_to_metrics filter when an error occurs

Bug Report

Describe the bug I wanted to simply count the errors in our log files to visualize the data and maybe also alert on anomalies so I used the "log_to_metrics" filter which was working fine at a first glance, but on some nodes Fluent Bit went completely crazy. Suddenly the ElasticSearch cluster had to deal with hundreds of thousands additional log lines per second and the storage was filled with these junk logs rapidly. The line could not append metrics is repeated until infinity. I tried increasing the buffer size, but that didn't help.

To Reproduce

Example log message if applicable:

[2024/10/04 09:08:11] [ warn] [http_client] cannot increase buffer: current=5000000 requested=5032768 max=5000000
[2024/10/04 09:08:11] [ info] [input] resume tail.0
[2024/10/04 09:08:11] [ info] [input] tail.0 resume (mem buf overlimit)
[2024/10/04 09:08:11] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2024/10/04 09:08:11] [ info] [input] pausing emitter_for_log_to_metrics.1
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
...

Steps to reproduce the problem:

Add ES output and log_to_metrics filter
Fill the buffer of the "http_client" (I guess)

Expected behavior Fluent Bit should not write such a massive amount of log lines.

Your Environment

Version used: v3.1.7
Configuration:

    [SERVICE]
        Daemon       Off
        Flush        5
        Log_Level    info
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server  On
        HTTP_Listen  0.0.0.0
        HTTP_Port    2020

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        DB                /var/log/flb_kube.db
        Read_from_Head    True
        Parser            docker
        Tag               kube.*
        Buffer_Chunk_Size 5MB
        Buffer_Max_Size   5MB
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   Off
        Refresh_Interval  10
        Docker_Mode       On

    [INPUT]
        Name     kubernetes_events
        Tag      kubernetes_events
        Kube_URL https://kubernetes.default.svc:443

    [INPUT]
        Name     fluentbit_metrics
        Tag      metrics

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Buffer_Size         128K
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off

    [FILTER]
        name               log_to_metrics
        kubernetes_mode    On
        match              kube.*
        tag                metrics
        metric_mode        counter
        metric_name        log_lines_total
        metric_description This metric counts all log lines
        regex              log .*
    
    [FILTER]
        name               log_to_metrics
        kubernetes_mode    On
        match              kube.*
        tag                metrics
        metric_mode        counter
        metric_name        log_errors_total
        metric_description This metric counts all errors in log lines
        regex              log .*(error|exception).*

    [OUTPUT]
        Name            es
        Match_regex     kube.*
        Host            elasticsearch.elk.svc.cluster.local
        Port            9200
        Type            _doc
        Logstash_Format On
        Logstash_Prefix kubernetes_cluster
        Retry_Limit     False
        Buffer_Size     2M
        Trace_Error     On
        Replace_Dots    On

    [OUTPUT]
        name               prometheus_exporter
        match              metrics
        host               0.0.0.0
        port               2021

Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.29.5 on AKS
Server type and version: Linux
Operating System and version: AKSUbuntu-2204gen2containerd-202409.23.0

Oct 09 '24 07:10 DaveOHenry

hi same here

Oct 14 '24 15:10 elkh510

I cannot verify this from my end. I think that the new interval timer functionality could (at least) reduce the error log rate. This feature will be released with Fluent Bit 3.2.0 See: https://github.com/fluent/fluent-bit/pull/9251

If you like, you can try to build the current master and test if this fixes your issue.

Oct 23 '24 13:10 drbugfinder-work

Updated to FluentBit 3.2.1 and reproduced the error. Afterwards added Flush_Interval_Sec 15 to the "log_to_metrics" filter config. Not sure if the value should be higher or lower, but I cannot reproduce the issue when this parameter is set. Thanks a lot for the fix.

Nov 27 '24 06:11 DaveOHenry