fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

FluentBit spam itself with "error registering chunk with tag"

Open panaji opened this issue 1 year ago • 47 comments

Bug Report

I see these errors in the aggregator few seconds after it starts. Usually, I see this error after reaching Emitter_Mem_Buf_Limit.

Our forwarder also tail fluentbit logs which exacerbates the problem. These errors will be tailed and re-forwarded to the aggregator which then again generate the same error, and the cycle continues. I don't see how to repro this, but is there any feature to suppress error messages, for example, don't generate this error more than x times per min?

To Reproduce

2022-07-20 18:43:21.7122020 | [2022/07/20 18:43:21] [ info] [fluent bit] version=1.9.4, commit=08de43e474, pid=1
2022-07-20 18:43:21.7122250 | [2022/07/20 18:43:21] [ info] [storage] version=1.2.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
2022-07-20 18:43:21.7122330 | [2022/07/20 18:43:21] [ info] [cmetrics] version=0.3.1
2022-07-20 18:43:21.7123510 | [2022/07/20 18:43:21] [ info] [input:forward:input.forward] listening on 0.0.0.0:24224
2022-07-20 18:43:21.7141520 | [2022/07/20 18:43:21] [ info] [output:forward:forward.mdsd] worker #0 started
2022-07-20 18:43:21.7357780 | [2022/07/20 18:43:21] [ info] [output:forward:forward.mdsd] worker #1 started
2022-07-20 18:43:21.7358770 | [2022/07/20 18:43:21] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
2022-07-20 18:43:21.7358880 | [2022/07/20 18:43:21] [ info] [sp] stream processor started
2022-07-20 18:43:56.7986370 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
2022-07-20 18:43:56.7986420 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
2022-07-20 18:43:56.7986500 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
2022-07-20 18:43:56.7986540 | [2022/07/20 18:43:56] [error] [input:emitter:re_emitted.container_log] error registering chunk with tag: mdsd.container.log
...
  • Steps to reproduce the problem:
  • N/A

Expected behavior FluentBit should not generate the same error messages more than n times per sec/min.

Screenshots

Your Environment

  • Version used: 1.9.4
  • Configuration:
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.22.11
  • Server type and version:
  • Operating System and version:
  • Filters and plugins: tail, kubernetes, rewrite_tag, lua, forward

Additional context

panaji avatar Jul 21 '22 22:07 panaji

any news?

pytomtoto avatar Aug 09 '22 12:08 pytomtoto

Hi @panaji, could you try this the last version of Fluent-Bit 1.9.7?

LatorreDev avatar Aug 19 '22 16:08 LatorreDev

@LatorreDev , do you have the commit/feature that may fix this issue? I think the error itself is legit, the concern is the spam that grow exponentially if we also tail fluentbit itself.

panaji avatar Aug 22 '22 20:08 panaji

@LatorreDev yes this behavior persists into version 1.9.7

jessejlt avatar Sep 13 '22 19:09 jessejlt

I believe I'm experiencing the same problem in 1.9.8:

` $ head -n20 dump Fluent Bit v1.9.8

  • Copyright (C) 2015-2022 The Fluent Bit Authors
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io

[2022/11/03 04:06:55] [ info] [fluent bit] version=1.9.8, commit=, pid=418 [2022/11/03 04:06:55] [ info] [storage] version=1.2.0, type=memory+filesystem, sync=normal, checksum=enabled, max_chunks_up=128 [2022/11/03 04:06:55] [ info] [storage] backlog input plugin: storage_backlog.6 [2022/11/03 04:06:55] [ info] [cmetrics] version=0.3.6 [2022/11/03 04:06:55] [ warn] [input:thermal:thermal.3] thermal device file not found [2022/11/03 04:06:55] [ info] [input:systemd:systemd.5] seek_cursor=s=eed386c868434a799bd4c3b033621892;i=51d... OK [2022/11/03 04:06:55] [ info] [input:storage_backlog:storage_backlog.6] queue memory limit: 4.8M [2022/11/03 04:06:55] [ info] [output:file:file.0] worker #0 started [2022/11/03 04:06:55] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020 [2022/11/03 04:06:55] [ info] [sp] stream processor started [2022/11/03 04:06:55] [ warn] [input] emitter.7 paused (mem buf overlimit) [2022/11/03 04:06:55] [error] [input:emitter:at_escaper] error registering chunk with tag: systemd.user_at_375336.service [2022/11/03 04:06:55] [error] [input:emitter:at_escaper] error registering chunk with tag: systemd.user_at_375336.service `

I'm launching the FLB process inside a container of its own. However I'm not seeing lots of usage in the on-disk buffer:

$ ls -l /var/lib/fluent-bit-storage/journal_systemd_db -rw-r--r-- 1 nobody nogroup 8192 Nov 2 23:57 /var/lib/fluent-bit-storage/journal_systemd_db

joshgc avatar Nov 03 '22 04:11 joshgc

After debugging last night, I have not solved the issue, but am quite confident its from a misbehaving rewrite_tag filter. It apperas to either never match, OR match all records in which case it produces an infinite loo. There is definitely a bug in (my understanding of rewrite_tag + its implementation)

joshgc avatar Nov 03 '22 16:11 joshgc

FWIW I'm not even sure this error message (which continues to haunt me), is even an error message. I get it spammed at my for 13seconds at start up, then it goes silent and the system "recovers". I think FLB is maybe overallocating the number of workers and running out of chunk space? I don'treally know, but now its very hard for me to tell if I'm losing data or not. Also systemd isn't even producing much data because I have tail_from_logs on. I would have thought thats a slow rate, but the problem does not exist for just the DUMMY input.

joshgc avatar Nov 03 '22 22:11 joshgc

This error took down a service today. I was running fluent-bit 2.0.4 and the logs were were full of "error registering chunk with tag" errors, and fluent-bit consumed nearly 100% CPU across all cores. We had to stop/restart fluent-bit as a temporary fix. I counted fluent-bit spamming its own logs about 50x per second.

I suspect it's entering an infinite loop where it has trouble sending all the logs to AWS Cloudwatch, which causes it to backup and generate more error logging, which adds to the queue of logs to AWS Cloudwatch, which increases the size of the backlog, causing more self-spamming of the logs, completing a loop which starts a death spiral.

Regardless of the other aspects of this, sometimes Fluent Bit is spamming its own logs with the exact same log entry more than 10x in the same second, and that's a problem on its own. Other logging systems will just switch to a summary, like "last log entry repeated 20x times".

Also, the log entry of "error registering chunk with tag" doesn't seem like the real root case, because the filters work fine before fluent-bit enters a death spiral.

markstos avatar Nov 21 '22 20:11 markstos

A couple thoughts here

  1. A feature or config option to Reduce verbosity / introduce features to control repeated messages
  2. Is there an isolated repro that we can use @markstos @joshgc @panaji ?

agup006 avatar Nov 21 '22 21:11 agup006

@agup006 , i couldn't repro this, but few people seem to have the same problem. If we can have a feature where fluentBit does not generate the same error messages more than n times per sec/min, we can probably investigate further. Right now, the logs are just filled with this message.

panaji avatar Nov 21 '22 21:11 panaji

It took several days of normal functioning for the error to appear for me, so I'm not confident I can offer an isolated repro, as I'm not certain of the trigger. I'm only certain that log-spamming the same error repeatedly wasn't helping and might have been part of triggering a logging death spiral. One workaround I could try is to quit forwarding logs for the fluent-bit process itself to AWS Cloudwatch, which would also prevent fluent-bit's own log volume from contributing to overloading the connection to a remote server. Now that I've experienced this, that seems like a safer design to avoid creating a bad loop any time a problem with an overloaded remote service would cause fluent-bit to generate even more logging to the remote service.

markstos avatar Nov 21 '22 22:11 markstos

I dug through my logs some more to see if I could find what Fluent Bit was logging just before the death spiral log spam sequence happened, it case the preceding logs were interesting. Here are the log entries I found before the repetitive log spam:

[ warn] [input] emitter.1 paused (mem buf overlimit)
[error] [tls] error: error:00000005:lib(0):func(0):DH lib"}
[error] [build/src/CMakeFiles/fluent-bit-static.dir/compiler_depend.ts:1199 errno=25] Inappropriate ioctl for device"}

(the order there isn't necessary chronological)

Around the same time, an ill-configured test app was dumping a massive amount of debugging data to the logs, and Fluent Bit was trying to process that.

Does that help?

markstos avatar Nov 22 '22 20:11 markstos

@vwbusguy recognized the death spiral description as being caused by Fluent Bit both reading to the systemd journal and also reading from it... which happens to be my configuration. In practice, most of the time that configuration works, but once Fluent Bit starts spamming its own logs at some point a threshold is crossed and Fluent Bit can't keep up with processing as input the logs it's generating as output...

So I'll update my configuration to break that loop as a workaround, while this issue can stay focused on preventing Fluent Bit from generating the same log entries repeatedly.

markstos avatar Nov 22 '22 22:11 markstos

It would be good to have a metric exposed when memory limits like Emitter_Mem_Buf_Limit are reached so people can alerts or act on that.

AlessioCasco avatar Nov 24 '22 17:11 AlessioCasco

I've been hitting this issue as well in k8s where the fluent-bit pod frequently becomes OOM and crashes.

On a 30 node cluster, with fluent-bit deployed as a daemonset, there was only one fluent-bit pod repeatably crashing. That pod was on the same node as a very log spammy pod. To test it was the spammy pod, I excluded it and the fluent-bit pod from the path in the tail config. That stopped the crashing, but that's hardly a fix, as I want those logs as well.

I managed to replicate the errors on a testing environment, where I had to

  1. turn off the fluent-bit daemonset
  2. deployed log-spamming pods to build up several large log files
  3. Wait 10 minutes for the log files to grow larg
  4. turn on the fluent-bit daemonset again.

Then I got the error logs right away the tens of thousands

[2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.log-spammer [2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.log-spammer [2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.fluent-bit-forwarder [2022/11/24 16:45:29] [error] [input:emitter:emitter_for_rewrite_tag.1] error registering chunk with tag: kubernetes.fluent-bit-forwarder

Although I didn't manage to crash it on the testing environment, it was able to make the error message.

The problem I had on the large k8s cluster seems to have fixed on it's own the the components got redeployed and there wasn't a massive backlog of logs for the fluent-bit pods to process.

Hope this can help anyone else.

stebbib avatar Nov 24 '22 18:11 stebbib

@agup006 Sorry for the ping, not sure if the team noticed that there is a potential way to reproduce the issue here above.

I think tailing from a large enough file might reproduce the issue or at least the error messages, in case devs don't have a k8s cluster to test it on.

Let me know if there's any more info I can provide.

stebbib avatar Dec 20 '22 14:12 stebbib

We ran into the same issue with systemd multiline parsing using 1.9.10:

[2023/01/17 15:07:45] [error] [input:emitter:emitter_for_systemd_linkerd_multiline] error registering chunk with tag: development.linkerd

Perhaps caused by a large batch of unsent logs due to an output misconfiguration. It is extremely concerning that Fluent Bit fails in this manner.

jstaffans-relex avatar Jan 17 '23 13:01 jstaffans-relex

Hey folks, an update here we have merged a "Log Suppression feature" https://github.com/fluent/fluent-bit/pull/6435 which should be released soon. This should help with errors that continue to show up

agup006 avatar Jan 18 '23 16:01 agup006

I built from master and deployed to a k8s cluster, and it did not seem to have any affect on the "error registering chunk with tag".

stebbib avatar Jan 19 '23 17:01 stebbib

@stebbib the PR referenced is not for that specific log error it is a more general fix to stop any spamming message error once configured. We’re you able to add the configuration referenced in the Pr?

agup006 avatar Jan 20 '23 00:01 agup006

Yes, added it to the output config I had and that seemed to work, at least without complaint from Fluent-Bit 😄

stebbib avatar Jan 20 '23 10:01 stebbib

We saw this when the es output host was taken down intentionally to test the disk based buffering.

The storage.total_limit_size is 1G but once this spamming is triggered it just kept growing the disk usage beyond the limit. We are very concerned about the potential for this type of disruption in our production clusters. In the mean time, we will test the new Log_Suppress_Interval setting as a workaround.

# fluent-bit version
Fluent Bit v1.9.10

# du -h /var/log/
...
46G     /var/log/flb-storage/tail.0
46G     /var/log/flb-storage
46G     /var/log/

# cat fluent-bit/etc/fluent-bit.conf
[SERVICE]
...
    flush                      1
    storage.path              /var/log/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.max_chunks_up     128
    storage.backlog.mem_limit 12M

    Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    DB                /var/log/flb_kube.db
    Parser            docker
    Docker_Mode       On
    Mem_Buf_Limit     12MB
    Skip_Long_Lines   On
    Refresh_Interval  10
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail
    storage.type       filesystem
    Rotate_Wait        5
    Ignore_Older       1h

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc.cluster.local:443
    Merge_Log           On
    Merge_Log_Key       data
    Keep_Log            On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Buffer_Size         32k
[FILTER]
    Name          modify
    Match         *
    # add the cluster name to all records
    Add           cluster_name some-cluster
[FILTER]
    Name          rewrite_tag
    Match         kube.*
    # prepend with the log type when value exists
    Rule          $kubernetes['labels']['logType']    ^([a-zA-Z0-9]+)$    $0.$TAG     false
    # prepend with "k8s" as the fallback rule so the tag always have the same number of parts
    Rule          $kubernetes['namespace_name']       .*        k8s.$TAG    false

[OUTPUT]
    Name            es
    Match           *
    AWS_Region      us-west-2
    AWS_Auth        On
    Host            some-host
    Port            9200
    Retry_Limit     6
    Replace_Dots    On
    Trace_Error               true
    Logstash_Format           on
    # takes precedence when value present
    Logstash_Prefix_Key        $kubernetes['labels']['logType']
    # fallback when "Logstash_Prefix_Key" value not present
    Logstash_Prefix            k8s
    storage.total_limit_size  1G
    Suppress_Type_Name        On
    Retry_Limit               2

[OUTPUT]
    Name                   s3
    Match                  *
    bucket                 some-bucket
    region                 us-west-2
    upload_timeout         10m
    retry_limit            2
    compression            gzip
    store_dir_limit_size   1G

gz243 avatar Mar 12 '23 18:03 gz243

Hi all, I confirm what @stebbib stated, it seems like it doesn't have any affect on the "error registering chunk with tag" log.

terencebor avatar Mar 13 '23 16:03 terencebor

I’m going to be evaluating Vector instead. This is a critical issue.

https://vector.dev/

Mark Stosberg (he/him)

Director of Systems & Security

@.*** | 765.277.1916

https://www.rideamigos.com https://rideamigos.com/

Changing the way the world commutes.

https://www.linkedin.com/company/rideamigos https://www.twitter.com/rideamigos https://www.facebook.com/rideamigos https://www.instagram.com/rideamigos https://rideamigos.com/newsletter-sign-up/

markstos avatar Mar 15 '23 13:03 markstos

@markstos @terencebor @gz243 did you try the log suppression feature mentioned above?

agup006 avatar Mar 15 '23 13:03 agup006

@agup006 correct me if I'm wrong, the log suppression feature works only for output plugins, unlike fluentd where the rewrite-tag is an output plugin, the rewrite-tag filter in fluent-bit is a filter plugin and as such can't use the log suppression feature.

stebbib avatar Mar 15 '23 13:03 stebbib

The 2.x upgrade also deleted the configuration file: https://github.com/fluent/fluent-bit/issues/6377

Finding two critical issues during my evaluation was all I needed to lose faith and move on. I need my logging solution to be rock solid, not delete its configuration file and crash the system.

On Wed, Mar 15, 2023 at 9:28 AM Anurag Gupta @.***> wrote:

@markstos https://github.com/markstos @terencebor https://github.com/terencebor @gz243 https://github.com/gz243 did you try the log suppression feature mentioned above?

— Reply to this email directly, view it on GitHub https://github.com/fluent/fluent-bit/issues/5769#issuecomment-1470012393, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGJZMUMJXBYGHK4WLOAZDW4G7W7ANCNFSM54JFUZOA . You are receiving this because you were mentioned.Message ID: @.***>

--

Mark Stosberg (he/him)

Director of Systems & Security

@.*** | 765.277.1916

https://www.rideamigos.com https://rideamigos.com/

Changing the way the world commutes.

https://www.linkedin.com/company/rideamigos https://www.twitter.com/rideamigos https://www.facebook.com/rideamigos https://www.instagram.com/rideamigos https://rideamigos.com/newsletter-sign-up/

markstos avatar Mar 15 '23 13:03 markstos

@markstos Thanks for assisting in that issue, and understand we all have to keep our services up and running :). All the best and appreciate your contributions

agup006 avatar Mar 15 '23 13:03 agup006

@agup006 correct me if I'm wrong, the log suppression feature works only for output plugins, unlike fluentd where the rewrite-tag is an output plugin, the rewrite-tag filter in fluent-bit is a filter plugin and as such can't use the log suppression feature.

This should be specifically for the log files that Fluent Bit generates itself, adding @lecaros @RicardoAAD who might have some sample config we can leverage to set it up

agup006 avatar Mar 15 '23 13:03 agup006

@agup006 , as others have reported (1, 2), I continue to see the error spammed after adding the log_suppress_interval to all my output plugins:

[OUTPUT]                                                                                                                                                                                                                                                                             
    Name                es
    log_suppress_interval   10s                                                                                                                                                                                                                                            
    ...
[2023/03/15 16:09:58] [ info] [fluent bit] version=2.0.9, commit=4c0ca4fc5f, pid=1
[2023/03/15 16:20:34] [error] [input:emitter:emitter_for_rewrite_tag.2] error registering chunk with tag: k8s.kube.var.log.containers.fluent-bit-6jdm9_logging_fluent-bit-ce93fba6f71656419554b66865b71e38a194e3a70657e2c357a2a84b294b8de3.log
[2023/03/15 16:20:34] [error] [input:emitter:emitter_for_rewrite_tag.2] error registering chunk with tag: k8s.kube.var.log.containers.fluent-bit-6jdm9_logging_fluent-bit-ce93fba6f71656419554b66865b71e38a194e3a70657e2c357a2a84b294b8de3.log
[2023/03/15 16:20:34] [error] [input:emitter:emitter_for_rewrite_tag.2] error registering chunk with tag: k8s.kube.var.log.containers.fluent-bit-6jdm9_logging_fluent-bit-ce93fba6f71656419554b66865b71e38a194e3a70657e2c357a2a84b294b8de3.log

Additional details:

  1. The error spam seems reproducible for me now when using a simple test app that logs 1000 chars every millisecond.
  2. After removing the rewrite_tag filter the error goes away.
  3. When the spam is triggered, storage.total_limit_size is not respected.

I concur with @stebbib 's comment that it does not seem the log suppression as it stands now applies to filters (only outputs). This is further correlated by this open issue as well as this comment.

It remains unclear to me as to why this error is triggered in the first place. The log suppression is a handy work around when it works. But another serious concern for me is why storage.total_limit_size is not respected irrespective of this error spam.

gz243 avatar Mar 15 '23 15:03 gz243