fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Fluentbit consumes 100% cpu and logs errors "error registering chunk" with multiline parser with versions 2.1.x (regression - from v1.9.7)

Open ryan65 opened this issue 2 years ago • 14 comments

I have a fluentbit/fluentd configuration that has been working pretty well. (version 1.9.7). Input using tail with no mem limits. After upgrading from 1.9.7 to 2.1.2 , we noticed that in high log rates (around 1000 log lines a second) , fluentbit reaches 100% cpu after several minutes, and starts to log many errors. Once this happens , it doesnt stop logging many errors each second . (its easily reproducible locally). " [input:emitter:emitter_for_multiline.4] error registering chunk" I also tried 2.1.8 and same problem. After returning to original version 1.9.7 , cpu was around 5% and no errors. I Tested this several times locally and it is consistent.

At first I got errors related to the rewrite_tag filter I used, so I thought maybe the rewrite had an issue so I removed this filter and now I get errors related to multiline so I assume its a much more basic issue that is not relevant to a specific filter. Once fluentbit starts writing these errors , there is no stopping and fluent constantly logs many such error lines.

[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm
[2023/08/02 10:31:45] [error] [input:emitter:emitter_for_multiline.4] error registering chunk with tag: kubelrc.srl-main-server.qa3.srl-main-server-deployment-123-lcgsm

I would guess this issue occurs from 2.1.x

To reproduce. I ran fluentbit / fluentd locally , with multiline parser filters, and many different types of mock components to reproduce logs at a high rate.

I have serveral Multiline parsers for different components , but they all more or less look like this one below . I assume though that any parser will do.

[INPUT]
    Name tail
    Path /var/log/containers/*srl-*.log
    Key log
    Refresh_Interval 10
    Tag               kubelrc.<container_name>.<namespace_name>.<pod_name>
    Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-
    Parser cri


[FILTER]
    name                    multiline
    match                   kubelrc.srl-tdigest*.*
    #Must set the field that contains the message to parse
    multiline.key_content   log
    multiline.parser        multiline-tdigest


[MULTILINE_PARSER]
    name          multiline-tdigest
    type          regex
    # TDigest state machine
    #lines starts with dddd-dd-dd (we assume its a timestamp)
    #any lines after that that DO NOT match dddd-dd-dd will be considered part of the previous line.
    #NOTE - this can cause a problem for example in restart of tdigest the first logs that have no timestamp will be appeneded to the last log with timestamp before restart.
    # We could consider being more strict with the cont rule and specify what to expect after new line
    # rules   |   state name   | regex pattern                   | next state name
    # --------|----------------|--------------------------------------------------
    rule         "start_state"   "/^(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG)/"  "cont"
    rule         "cont"          "/^(?!(([0-9]{4}-[0-9]{2}-[0-9]{2})|IMAGE_TAG))/"    "cont"
    #rule         "cont"          "/^(\s+at).*/" #Java exception only

ryan65 avatar Aug 02 '23 10:08 ryan65

Hi Can someone please look into this. Its definitely an issue that requires attention as it is a regression from 2.1.x (at least I know for sure its from 2.1.2 and continues in 2.1.8) Last week we returned to version 1.9.7 and everything is working fine . so the problem we had was definitely due to the upgrade .

ryan65 avatar Aug 06 '23 06:08 ryan65

New Update: Seems that the issue started after 2.0.9. Because 2.0.9 Does not have this problem. and we know that 2.1.2 does. I know I am not the only one with this issue , As specified by others here : https://github.com/fluent/fluent-bit/issues/4940

ryan65 avatar Aug 17 '23 06:08 ryan65

That bit about setting buffer to off alleviating the issue mentioned by terrencebor sounds quite interesting, do you think you could give it a try? Not as a workaround suggestion but to confirm that it's a lead. That would make it much easier for the team to address it since we are quite limited time wise.

Thank you for taking the time to report the issue and for the patience.

leonardo-albertovich avatar Aug 17 '23 12:08 leonardo-albertovich

That bit about setting buffer to off alleviating the issue mentioned by terrencebor sounds quite interesting, do you think you could give it a try? Not as a workaround suggestion but to confirm that it's a lead. That would make it much easier for the team to address it since we are quite limited time wise.

Thank you for taking the time to report the issue and for the patience.

Hi Leonardo Actually I already checked that before , but just to make sure , I checked this right now , and adding "buffer off" to the ml parsers does not solve the 100% cpu issue. The problem remains.

ryan65 avatar Aug 17 '23 13:08 ryan65

Same here, Now I have to revert to old version in my case I tried to upgrade from 2.0.6 to 2.1.9

lllsJllsJ avatar Sep 20 '23 12:09 lllsJllsJ

Same.

adl3n avatar Oct 04 '23 11:10 adl3n

I have the same issue in 2.1.10, have created a reproducible test case here https://github.com/rhysm/flb-multiline-repro

I have also attached callgrind output from running fluent-bit for ~1 minute. The attachment contains callgrind output for the last working version 2.0.14 and the latest 2.1.10 which still appears to have this issue.

flb-7782-callgrind.tar.gz

rhysm avatar Nov 02 '23 03:11 rhysm

Using the multiline parser as part of the input itself rather than on its own as a filter solved this issue for us: https://docs.fluentbit.io/manual/pipeline/inputs/tail#multiline-support

robmcelhinney avatar Nov 15 '23 15:11 robmcelhinney

same issue here :(

alexku7 avatar Jan 15 '24 14:01 alexku7

Seems that 2.2.2. still has this issue of 100% cpu #4940 Hi @leonardo-albertovich, Any chance someone can have a look at this regression . seems pretty problematic, dont you think ? Seems that it started after 2.0.9. Thanks.

ryan65 avatar Jan 18 '24 07:01 ryan65

2.0.9 works correctly. The issue started in the next version after 2.0.9

alexku7 avatar Jan 18 '24 07:01 alexku7

2.2.0 is same issue

Park-jeong-seop avatar Jan 25 '24 01:01 Park-jeong-seop

Same for me on 2.2.0.

lbodor avatar Jan 25 '24 01:01 lbodor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar May 01 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar May 07 '24 01:05 github-actions[bot]