in_tail: release event loop between files (#3947)
This patch fixes a problem where in_tail will block the event loop until all files have been processed. This problem becomes more obvious when the number of files that the plugin is monitoring is large (>100)
Signed-off-by: Pablo E. Colazurdo [email protected]
This helps with issue #3947. The goal is to release the function after each file is processed, which will be invoked again by the event loop. Without this fix the function won't release until all files has been processed which, in situations where there are many files (>200) or the files are too long, causes stuck files and the output plugins aren't invoked until the function finishes.
This has been tested using tools/cio with different 100, 1000, 2000 files and 1000 and 10000 records.
tools/cio -p "tests/data/1.5kb.txt" -v -e 1000 -w 1000
tools/cio -p "tests/data/1.5kb.txt" -v -e 1000 -w 10000
tools/cio -p "tests/data/1.5kb.txt" -v -e 2000 -w 10000
Enter [N/A] in the box, if an item is not applicable to your change.
Testing Before we can approve your change; please submit the following in a comment:
- [N/A] Example configuration file for the change
- [N/A] Debug log output from testing the change
- [N/A] Attached Valgrind output that shows no leaks or memory corruption was found
Documentation
- [N/A] Documentation required for this feature
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
I can't add the valgrind checks because they are showing leaks even without the proposed fix. Not sure how to submit the checks for only this part or if that helps.
@pcolazurdo I'm not super worried about Valgrind, since this change introduces no new allocations and is very small, I think we can be decently confident it doesn't add any new leaks.
Submitting valgrind output that shows existing leaks would be fine though, the key is that if you can show your change doesn't introduce any net new leaks.
Also small thing on the commit message, it's supposed to in_tail, underscore not dash
@pcolazurdo Could you check DCO error ? https://probot.github.io/apps/dco/
@pcolazurdo Thank you for fixing DCO error.
Could you check commit message ?
e.g. in_tail: something
https://github.com/fluent/fluent-bit/blob/master/CONTRIBUTING.md#commit-changes
one issue I found with this fix is that when I use standard golang output plugin and "workers 2" in the output configuration, there is some data missing happening. The count from "curl -s http://127.0.0.1:2020/api/v1/metrics | jq '.'" is smaller than the actual number of entries in the data directory. I tried to send the output to stdout at the same, also found some data entries missing in the stdout output. Without "workers" parameter, this fix works well.
@wubn2000 Does that only happen with this fix or without it as well?
Unless we did something wrong this change shouldn't cause log loss...
It happens with this fix. But I found if I change storage.backlog.mem_limit and mem_buf_limit to 4096MB (previously I used 500MB, which always gave me data loss on my testing data 1000 files, with 9.3M messages in total, each message is about 1.5K), I didn't see the data loss for my testing data set.
I've tried these settings with and without the fix and the data loss happens in both cases. It looks like when mem_buf_limit is low, there are some situations where there is data loss. I'll open a new issue for this particular case as it seems to not be related to this fix.
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.
@pcolazurdo Sorry I forgot to follow up on this for a while- have we gotten back feedback on whether or not this met the original need?
@PettitWesley Yes, feedback was very positive. Tests with 2000 files and 1000s of log lines per second reduced processing time from multiple hours (+2) to just 2 minutes to catch up. The tests were done by porting the fix to 1.8.2 due to other issues in 18.9 like: https://github.com/fluent/fluent-bit/issues/4273
@edsiper I remember you said you wanted to wait for the testing results to merge this one, but now that they have come back positive, can we proceed with this?
we got an internal report that this specific approach would not work in the latest 1.8.x series (not sure "why").
We are submitting another PR that proposes a similar solution but is based on "bytes consumed".
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.