fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

in_tail: release event loop between files (#3947)

Open pcolazurdo opened this issue 4 years ago • 16 comments

This patch fixes a problem where in_tail will block the event loop until all files have been processed. This problem becomes more obvious when the number of files that the plugin is monitoring is large (>100)

Signed-off-by: Pablo E. Colazurdo [email protected]

This helps with issue #3947. The goal is to release the function after each file is processed, which will be invoked again by the event loop. Without this fix the function won't release until all files has been processed which, in situations where there are many files (>200) or the files are too long, causes stuck files and the output plugins aren't invoked until the function finishes.

This has been tested using tools/cio with different 100, 1000, 2000 files and 1000 and 10000 records.

tools/cio -p "tests/data/1.5kb.txt" -v -e 1000 -w 1000
tools/cio -p "tests/data/1.5kb.txt" -v -e 1000 -w 10000
tools/cio -p "tests/data/1.5kb.txt" -v -e 2000 -w 10000

Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • [N/A] Debug log output from testing the change
  • [N/A] Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • [N/A] Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

pcolazurdo avatar Oct 20 '21 10:10 pcolazurdo

I can't add the valgrind checks because they are showing leaks even without the proposed fix. Not sure how to submit the checks for only this part or if that helps.

pcolazurdo avatar Oct 20 '21 10:10 pcolazurdo

@pcolazurdo I'm not super worried about Valgrind, since this change introduces no new allocations and is very small, I think we can be decently confident it doesn't add any new leaks.

Submitting valgrind output that shows existing leaks would be fine though, the key is that if you can show your change doesn't introduce any net new leaks.

PettitWesley avatar Oct 20 '21 18:10 PettitWesley

Also small thing on the commit message, it's supposed to in_tail, underscore not dash

PettitWesley avatar Oct 20 '21 18:10 PettitWesley

@pcolazurdo Could you check DCO error ? https://probot.github.io/apps/dco/

nokute78 avatar Oct 22 '21 22:10 nokute78

@pcolazurdo Could you check DCO error ? probot.github.io/apps/dco

Fixed - thanks

pcolazurdo avatar Oct 24 '21 08:10 pcolazurdo

@pcolazurdo Thank you for fixing DCO error.

Could you check commit message ? e.g. in_tail: something https://github.com/fluent/fluent-bit/blob/master/CONTRIBUTING.md#commit-changes

nokute78 avatar Oct 30 '21 08:10 nokute78

one issue I found with this fix is that when I use standard golang output plugin and "workers 2" in the output configuration, there is some data missing happening. The count from "curl -s http://127.0.0.1:2020/api/v1/metrics | jq '.'" is smaller than the actual number of entries in the data directory. I tried to send the output to stdout at the same, also found some data entries missing in the stdout output. Without "workers" parameter, this fix works well.

wubn2000 avatar Nov 01 '21 15:11 wubn2000

@wubn2000 Does that only happen with this fix or without it as well?

Unless we did something wrong this change shouldn't cause log loss...

PettitWesley avatar Nov 01 '21 17:11 PettitWesley

It happens with this fix. But I found if I change storage.backlog.mem_limit and mem_buf_limit to 4096MB (previously I used 500MB, which always gave me data loss on my testing data 1000 files, with 9.3M messages in total, each message is about 1.5K), I didn't see the data loss for my testing data set.

wubn2000 avatar Nov 01 '21 18:11 wubn2000

I've tried these settings with and without the fix and the data loss happens in both cases. It looks like when mem_buf_limit is low, there are some situations where there is data loss. I'll open a new issue for this particular case as it seems to not be related to this fix.

pcolazurdo avatar Nov 02 '21 15:11 pcolazurdo

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Dec 03 '21 01:12 github-actions[bot]

@pcolazurdo Sorry I forgot to follow up on this for a while- have we gotten back feedback on whether or not this met the original need?

PettitWesley avatar Dec 03 '21 02:12 PettitWesley

@PettitWesley Yes, feedback was very positive. Tests with 2000 files and 1000s of log lines per second reduced processing time from multiple hours (+2) to just 2 minutes to catch up. The tests were done by porting the fix to 1.8.2 due to other issues in 18.9 like: https://github.com/fluent/fluent-bit/issues/4273

pcolazurdo avatar Dec 07 '21 12:12 pcolazurdo

@edsiper I remember you said you wanted to wait for the testing results to merge this one, but now that they have come back positive, can we proceed with this?

PettitWesley avatar Dec 13 '21 16:12 PettitWesley

we got an internal report that this specific approach would not work in the latest 1.8.x series (not sure "why").

We are submitting another PR that proposes a similar solution but is based on "bytes consumed".

edsiper avatar Jan 08 '22 02:01 edsiper

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Apr 09 '22 02:04 github-actions[bot]

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Nov 26 '24 02:11 github-actions[bot]

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Aug 27 '25 02:08 github-actions[bot]