Stuck in a loop with ES output and log_to_metrics filter when an error occurs
Bug Report
Describe the bug
I wanted to simply count the errors in our log files to visualize the data and maybe also alert on anomalies so I used the "log_to_metrics" filter which was working fine at a first glance, but on some nodes Fluent Bit went completely crazy. Suddenly the ElasticSearch cluster had to deal with hundreds of thousands additional log lines per second and the storage was filled with these junk logs rapidly.
The line could not append metrics is repeated until infinity. I tried increasing the buffer size, but that didn't help.
To Reproduce
- Example log message if applicable:
[2024/10/04 09:08:11] [ warn] [http_client] cannot increase buffer: current=5000000 requested=5032768 max=5000000
[2024/10/04 09:08:11] [ info] [input] resume tail.0
[2024/10/04 09:08:11] [ info] [input] tail.0 resume (mem buf overlimit)
[2024/10/04 09:08:11] [ warn] [input] emitter.3 paused (mem buf overlimit)
[2024/10/04 09:08:11] [ info] [input] pausing emitter_for_log_to_metrics.1
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
[2024/10/04 09:08:11] [error] [filter:log_to_metrics:log_to_metrics.1] could not append metrics
...
- Steps to reproduce the problem:
- Add ES output and log_to_metrics filter
- Fill the buffer of the "http_client" (I guess)
Expected behavior Fluent Bit should not write such a massive amount of log lines.
Your Environment
- Version used: v3.1.7
- Configuration:
[SERVICE]
Daemon Off
Flush 5
Log_Level info
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Path /var/log/containers/*.log
DB /var/log/flb_kube.db
Read_from_Head True
Parser docker
Tag kube.*
Buffer_Chunk_Size 5MB
Buffer_Max_Size 5MB
Mem_Buf_Limit 50MB
Skip_Long_Lines Off
Refresh_Interval 10
Docker_Mode On
[INPUT]
Name kubernetes_events
Tag kubernetes_events
Kube_URL https://kubernetes.default.svc:443
[INPUT]
Name fluentbit_metrics
Tag metrics
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Buffer_Size 128K
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[FILTER]
name log_to_metrics
kubernetes_mode On
match kube.*
tag metrics
metric_mode counter
metric_name log_lines_total
metric_description This metric counts all log lines
regex log .*
[FILTER]
name log_to_metrics
kubernetes_mode On
match kube.*
tag metrics
metric_mode counter
metric_name log_errors_total
metric_description This metric counts all errors in log lines
regex log .*(error|exception).*
[OUTPUT]
Name es
Match_regex kube.*
Host elasticsearch.elk.svc.cluster.local
Port 9200
Type _doc
Logstash_Format On
Logstash_Prefix kubernetes_cluster
Retry_Limit False
Buffer_Size 2M
Trace_Error On
Replace_Dots On
[OUTPUT]
name prometheus_exporter
match metrics
host 0.0.0.0
port 2021
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.29.5 on AKS
- Server type and version: Linux
- Operating System and version: AKSUbuntu-2204gen2containerd-202409.23.0
hi same here
I cannot verify this from my end. I think that the new interval timer functionality could (at least) reduce the error log rate. This feature will be released with Fluent Bit 3.2.0 See: https://github.com/fluent/fluent-bit/pull/9251
If you like, you can try to build the current master and test if this fixes your issue.
Updated to FluentBit 3.2.1 and reproduced the error.
Afterwards added Flush_Interval_Sec 15 to the "log_to_metrics" filter config. Not sure if the value should be higher or lower, but I cannot reproduce the issue when this parameter is set.
Thanks a lot for the fix.