fluent-bit
fluent-bit copied to clipboard
event_loop: Check loop condition before removing event from bucket queue
Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.
This is caused by the priority event loop handling of network timeout events. That kind of event is injected into the loop and then added to the bucket queue, but only once. If the event is not handled after being removed from the bucket queue, such event is not re-added to the queue, and not processed on loop re-entry. This situation happens because the event is first removed from the bucket queue, and then the loop condition is checked.
- Changed the order of statements in priority queue loop macro: we now check if the loop has reached the iteration limit before removing the event from the bucket queue.
- Added a test for event priority loop. It validates that when the event loop finishes all iterations, and the injected event is next in the priority queue, then this event will be processed next time the priority event loop runs.
Signed-off-by: zhenyami [email protected]
Fixes https://github.com/fluent/fluent-bit/issues/5553
Related changes
- https://github.com/fluent/fluent-bit/pull/4869
- https://github.com/fluent/fluent-bit/pull/4624
Testing Before we can approve your change; please submit the following in a comment:
- [N/A] Example configuration file for the change
- [ ] Debug log output from testing the change
- [x] Attached Valgrind output that shows no leaks or memory corruption was found
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
- [N/A] Attached local packaging test output showing all targets (including any new ones) build.
Documentation
- [N/A] Documentation required for this feature
Backporting
- [ ] Backport to latest stable release.
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
cc: @matthewfala
is this ready to go ?
Additional changes
I reformatted code to be below 90 characters line length limit. I also added a test to validate injected event handling in priority event loop.
This test fails without the fix to the event loop order:
$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms... [ OK ]
Test test_non_blocking_and_blocking_timeout... [ OK ]
Test test_infinite_wait... [ OK ]
Test event_loop_stress_priority_add_delete... [ OK ]
Test test_inject_event_priority_loop... [ FAILED ]
flb_event_loop.c:633: Check event->priority == 1... failed
Expected injected event with priority 1, instead got event with priority 2
FAILED: 1 of 5 unit tests has failed.
Test passes with the fix applied:
$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms... [ OK ]
Test test_non_blocking_and_blocking_timeout... [ OK ]
Test test_infinite_wait... [ OK ]
Test event_loop_stress_priority_add_delete... [ OK ]
Test test_inject_event_priority_loop... [ OK ]
SUCCESS: All unit tests have passed.
Valgrind output
I ran Valgrind with the new test.
$ valgrind ./bin/flb-it-flb_event_loop
...
Test test_inject_event_priority_loop...
==2428== Conditional jump or move depends on uninitialised value(s)
==2428== at 0x40B2CF: flb_event_load_bucket_queue_event (flb_event_loop.h:31)
==2428== by 0x40B458: flb_event_load_injected_events (flb_event_loop.h:67)
==2428== by 0x40D2AC: test_inject_event_priority_loop (flb_event_loop.c:614)
==2428== by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428== by 0x40994D: test_run_ (acutest.h:1103)
==2428== by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Conditional jump or move depends on uninitialised value(s)
==2428== at 0x40F25D: flb_pipe_close (flb_pipe.c:103)
==2428== by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428== by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428== by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428== by 0x40994D: test_run_ (acutest.h:1103)
==2428== by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Syscall param close(fd) contains uninitialised byte(s)
==2428== at 0x4E4278D: ??? (in /lib64/libpthread-2.17.so)
==2428== by 0x40F26F: flb_pipe_close (flb_pipe.c:107)
==2428== by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428== by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428== by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428== by 0x40994D: test_run_ (acutest.h:1103)
==2428== by 0x40AC35: main (acutest.h:1700)
==2428==
==2428==
==2428== HEAP SUMMARY:
==2428== in use at exit: 3,728 bytes in 7 blocks
==2428== total heap usage: 23 allocs, 16 frees, 10,608 bytes allocated
==2428==
==2428== LEAK SUMMARY:
==2428== definitely lost: 0 bytes in 0 blocks
==2428== indirectly lost: 0 bytes in 0 blocks
==2428== possibly lost: 3,648 bytes in 6 blocks
==2428== still reachable: 80 bytes in 1 blocks
==2428== suppressed: 0 bytes in 0 blocks
==2428== Rerun with --leak-check=full to see details of leaked memory
Ready for review. I don't know what to do about the MacOS test failure. I see some build issues are being worked on in other PRs.
@zhenyami, this change is significant! It's clear you put effort into understanding the event loop system and making appropriate changes.
@edsiper It seems that this is a legitimate problem with the priority event loop that will cause injected events to be dropped in edge cases. Would it be possible prioritizing merging this into Fluent Bit?
The test case looks good.
- 3 priority 0 events are added, with 1 priority 1 event injected {0,0,0,1}
- A loop of max 3 iterations loops over the 3 priority 0 events {0,0,0} (potentially drops event 3+1=4, the injected event)
- 3 priority 2 events are added {1,2,2,2}
- A loop of max 4 iterations loops over the events {1,2,2,2} checking priorities are handled in the correct order.
@leonardo-albertovich Is there any progress on reviewing this PR. Seems like an important fix to a bug in the event loop. Not sure how widespread issues that may arise might be, but I would think at some point people will complain about fluent bit freezing up and that would be linked back to this PR's highlighted issue.
I haven't but if it others have that's OK by me.
folks, do we have 100% confidence on this PR and tests done ?
I see the unit test here:
https://github.com/fluent/fluent-bit/pull/5649/files#diff-aa41e6d7b7569a427563d189aa551f0563a8d2b3e3f25ea48b556150f308a785R582
but I want to make sure the unit test reproduces the original issue
@matthewfala and @PettitWesley can you confirm if all required tests are done and we have full confidence on this fix.
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.
up
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.
@matthewfala do you think this is still relevant? Do you think we could wrap it up?
@leonardo-albertovich, A while back, I looked into this deeply and believe that the tests are sufficient to cause the bug and show that the solution works.
I'm confident in the presented solution. However in the future we can refactor with the suggested change to make it slightly more readable.
@matthewfala
Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.
The issue description kind of sounds like the mk_event_inject bug that we found right? https://github.com/fluent/fluent-bit/issues/6822
I'm out of the loop so I trust your judgment @matthewfala.
As for the bug mentioned by @PettitWesley, I don't think it's related, of course, the priority event loop is the root of both but that's as far as it goes from what I see. I guess there's a lesson to be learned in it.
@PettitWesley I believe these bugs are not related. I think this code is ready to be merged.
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.