fluent-bit in_tail: release event loop between files (#3947)

This patch fixes a problem where in_tail will block the event loop until all files have been processed. This problem becomes more obvious when the number of files that the plugin is monitoring is large (>100)

Signed-off-by: Pablo E. Colazurdo [email protected]

This helps with issue #3947. The goal is to release the function after each file is processed, which will be invoked again by the event loop. Without this fix the function won't release until all files has been processed which, in situations where there are many files (>200) or the files are too long, causes stuck files and the output plugins aren't invoked until the function finishes.

This has been tested using tools/cio with different 100, 1000, 2000 files and 1000 and 10000 records.

tools/cio -p "tests/data/1.5kb.txt" -v -e 1000 -w 1000
tools/cio -p "tests/data/1.5kb.txt" -v -e 1000 -w 10000
tools/cio -p "tests/data/1.5kb.txt" -v -e 2000 -w 10000

Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

[N/A] Example configuration file for the change
[N/A] Debug log output from testing the change

[N/A] Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

[N/A] Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Oct 20 '21 10:10 pcolazurdo

I can't add the valgrind checks because they are showing leaks even without the proposed fix. Not sure how to submit the checks for only this part or if that helps.

Oct 20 '21 10:10 pcolazurdo

@pcolazurdo I'm not super worried about Valgrind, since this change introduces no new allocations and is very small, I think we can be decently confident it doesn't add any new leaks.

Submitting valgrind output that shows existing leaks would be fine though, the key is that if you can show your change doesn't introduce any net new leaks.

Oct 20 '21 18:10 PettitWesley

Also small thing on the commit message, it's supposed to in_tail, underscore not dash

Oct 20 '21 18:10 PettitWesley

@pcolazurdo Could you check DCO error ? https://probot.github.io/apps/dco/

Oct 22 '21 22:10 nokute78

@pcolazurdo Could you check DCO error ? probot.github.io/apps/dco

Fixed - thanks

Oct 24 '21 08:10 pcolazurdo

@pcolazurdo Thank you for fixing DCO error.

Could you check commit message ? e.g. in_tail: something https://github.com/fluent/fluent-bit/blob/master/CONTRIBUTING.md#commit-changes

Oct 30 '21 08:10 nokute78

one issue I found with this fix is that when I use standard golang output plugin and "workers 2" in the output configuration, there is some data missing happening. The count from "curl -s http://127.0.0.1:2020/api/v1/metrics | jq '.'" is smaller than the actual number of entries in the data directory. I tried to send the output to stdout at the same, also found some data entries missing in the stdout output. Without "workers" parameter, this fix works well.

Nov 01 '21 15:11 wubn2000

@wubn2000 Does that only happen with this fix or without it as well?

Unless we did something wrong this change shouldn't cause log loss...

Nov 01 '21 17:11 PettitWesley

It happens with this fix. But I found if I change storage.backlog.mem_limit and mem_buf_limit to 4096MB (previously I used 500MB, which always gave me data loss on my testing data 1000 files, with 9.3M messages in total, each message is about 1.5K), I didn't see the data loss for my testing data set.

Nov 01 '21 18:11 wubn2000

I've tried these settings with and without the fix and the data loss happens in both cases. It looks like when mem_buf_limit is low, there are some situations where there is data loss. I'll open a new issue for this particular case as it seems to not be related to this fix.

Nov 02 '21 15:11 pcolazurdo

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Dec 03 '21 01:12 github-actions[bot]

@pcolazurdo Sorry I forgot to follow up on this for a while- have we gotten back feedback on whether or not this met the original need?

Dec 03 '21 02:12 PettitWesley

@PettitWesley Yes, feedback was very positive. Tests with 2000 files and 1000s of log lines per second reduced processing time from multiple hours (+2) to just 2 minutes to catch up. The tests were done by porting the fix to 1.8.2 due to other issues in 18.9 like: https://github.com/fluent/fluent-bit/issues/4273

Dec 07 '21 12:12 pcolazurdo

@edsiper I remember you said you wanted to wait for the testing results to merge this one, but now that they have come back positive, can we proceed with this?

Dec 13 '21 16:12 PettitWesley

we got an internal report that this specific approach would not work in the latest 1.8.x series (not sure "why").

We are submitting another PR that proposes a similar solution but is based on "bytes consumed".

Jan 08 '22 02:01 edsiper

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Apr 09 '22 02:04 github-actions[bot]

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Nov 26 '24 02:11 github-actions[bot]

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Aug 27 '25 02:08 github-actions[bot]