Fluentbit consumes 100% cpu and logs errors "error registering chunk" with multiline parser with versions 2.1.x (regression - from v1.9.7)
I have a fluentbit/fluentd configuration that has been working pretty well. (version 1.9.7). Input using tail with no mem limits. After upgrading from 1.9.7 to 2.1.2 , we noticed that in high log rates (around 1000 log lines a second) , fluentbit reaches 100% cpu after several minutes, and starts to log many errors. Once this happens , it doesnt stop logging many errors each second . (its easily reproducible locally). " [input:emitter:emitter_for_multiline.4] error registering chunk" I also tried 2.1.8 and same problem. After returning to original version 1.9.7 , cpu was around 5% and no errors. I Tested this several times locally and it is consistent.
At first I got errors related to the rewrite_tag filter I used, so I thought maybe the rewrite had an issue so I removed this filter and now I get errors related to multiline so I assume its a much more basic issue that is not relevant to a specific filter. Once fluentbit starts writing these errors , there is no stopping and fluent constantly logs many such error lines.
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
I would guess this issue occurs from 2.1.x
To reproduce. I ran fluentbit / fluentd locally , with multiline parser filters, and many different types of mock components to reproduce logs at a high rate.
I have serveral Multiline parsers for different components , but they all more or less look like this one below . I assume though that any parser will do.
[INPUT]
Name tail
Path /var/log/containers/*srl-*.log
Key log
Refresh_Interval 10
Tag kubelrc.<container_name>.<namespace_name>.<pod_name>
Tag_Regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
Parser cri
[FILTER]
name multiline
match kubelrc.srl-tdigest*.*
#Must set the field that contains the message to parse
multiline.key_content log
multiline.parser multiline-tdigest
[MULTILINE_PARSER]
name multiline-tdigest
type regex
# TDigest state machine
#lines starts with dddd-dd-dd (we assume its a timestamp)
#any lines after that that DO NOT match dddd-dd-dd will be considered part of the previous line.
#NOTE - this can cause a problem for example in restart of tdigest the first logs that have no timestamp will be appeneded to the last log with timestamp before restart.
# We could consider being more strict with the cont rule and specify what to expect after new line
# rules | state name | regex pattern | next state name
# --------|----------------|--------------------------------------------------
rule "start_state" "/^(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG)/" "cont"
rule "cont" "/^(?!(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG))/" "cont"
#rule "cont" "/^(\s+at).*/" #Java exception only
Hi Can someone please look into this. Its definitely an issue that requires attention as it is a regression from 2.1.x (at least I know for sure its from 2.1.2 and continues in 2.1.8) Last week we returned to version 1.9.7 and everything is working fine . so the problem we had was definitely due to the upgrade .
New Update: Seems that the issue started after 2.0.9. Because 2.0.9 Does not have this problem. and we know that 2.1.2 does. I know I am not the only one with this issue , As specified by others here : https://github.com/fluent/fluent-bit/issues/4940
That bit about setting buffer to off alleviating the issue mentioned by terrencebor sounds quite interesting, do you think you could give it a try? Not as a workaround suggestion but to confirm that it's a lead. That would make it much easier for the team to address it since we are quite limited time wise.
Thank you for taking the time to report the issue and for the patience.
That bit about setting
buffertooffalleviating the issue mentioned by terrencebor sounds quite interesting, do you think you could give it a try? Not as a workaround suggestion but to confirm that it's a lead. That would make it much easier for the team to address it since we are quite limited time wise.Thank you for taking the time to report the issue and for the patience.
Hi Leonardo Actually I already checked that before , but just to make sure , I checked this right now , and adding "buffer off" to the ml parsers does not solve the 100% cpu issue. The problem remains.
Same here, Now I have to revert to old version in my case I tried to upgrade from 2.0.6 to 2.1.9
Same.
I have the same issue in 2.1.10, have created a reproducible test case here https://github.com/rhysm/flb-multiline-repro
I have also attached callgrind output from running fluent-bit for ~1 minute. The attachment contains callgrind output for the last working version 2.0.14 and the latest 2.1.10 which still appears to have this issue.
Using the multiline parser as part of the input itself rather than on its own as a filter solved this issue for us: https://docs.fluentbit.io/manual/pipeline/inputs/tail#multiline-support
same issue here :(
Seems that 2.2.2. still has this issue of 100% cpu #4940 Hi @leonardo-albertovich, Any chance someone can have a look at this regression . seems pretty problematic, dont you think ? Seems that it started after 2.0.9. Thanks.
2.0.9 works correctly. The issue started in the next version after 2.0.9
2.2.0 is same issue
Same for me on 2.2.0.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
This issue was closed because it has been stalled for 5 days with no activity.