fluent-bit event_loop: Check loop condition before removing event from bucket queue

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.

This is caused by the priority event loop handling of network timeout events. That kind of event is injected into the loop and then added to the bucket queue, but only once. If the event is not handled after being removed from the bucket queue, such event is not re-added to the queue, and not processed on loop re-entry. This situation happens because the event is first removed from the bucket queue, and then the loop condition is checked.

Changed the order of statements in priority queue loop macro: we now check if the loop has reached the iteration limit before removing the event from the bucket queue.
Added a test for event priority loop. It validates that when the event loop finishes all iterations, and the injected event is next in the priority queue, then this event will be processed next time the priority event loop runs.

Signed-off-by: zhenyami [email protected]

Fixes https://github.com/fluent/fluent-bit/issues/5553

Related changes

https://github.com/fluent/fluent-bit/pull/4869
https://github.com/fluent/fluent-bit/pull/4624

Testing Before we can approve your change; please submit the following in a comment:

[N/A] Example configuration file for the change
[ ] Debug log output from testing the change

[x] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

[N/A] Attached local packaging test output showing all targets (including any new ones) build.

Documentation

[N/A] Documentation required for this feature

Backporting

[ ] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Jun 27 '22 19:06 zhenyami

cc: @matthewfala

Jun 27 '22 21:06 edsiper

is this ready to go ?

Jul 11 '22 23:07 edsiper

Additional changes

I reformatted code to be below 90 characters line length limit. I also added a test to validate injected event handling in priority event loop.

This test fails without the fix to the event loop order:

$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms...              [ OK ]
Test test_non_blocking_and_blocking_timeout...  [ OK ]
Test test_infinite_wait...                      [ OK ]
Test event_loop_stress_priority_add_delete...   [ OK ]
Test test_inject_event_priority_loop...         [ FAILED ]
  flb_event_loop.c:633: Check event->priority == 1... failed
    Expected injected event with priority 1, instead got event with priority 2
FAILED: 1 of 5 unit tests has failed.

Test passes with the fix applied:

$ ./bin/flb-it-flb_event_loop
Test test_simple_timeout_1000ms...              [ OK ]
Test test_non_blocking_and_blocking_timeout...  [ OK ]
Test test_infinite_wait...                      [ OK ]
Test event_loop_stress_priority_add_delete...   [ OK ]
Test test_inject_event_priority_loop...         [ OK ]
SUCCESS: All unit tests have passed.

Jul 13 '22 20:07 zhenyami

Valgrind output

I ran Valgrind with the new test.

$ valgrind ./bin/flb-it-flb_event_loop
...
Test test_inject_event_priority_loop...         
==2428== Conditional jump or move depends on uninitialised value(s)
==2428==    at 0x40B2CF: flb_event_load_bucket_queue_event (flb_event_loop.h:31)
==2428==    by 0x40B458: flb_event_load_injected_events (flb_event_loop.h:67)
==2428==    by 0x40D2AC: test_inject_event_priority_loop (flb_event_loop.c:614)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Conditional jump or move depends on uninitialised value(s)
==2428==    at 0x40F25D: flb_pipe_close (flb_pipe.c:103)
==2428==    by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428==    by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428== Syscall param close(fd) contains uninitialised byte(s)
==2428==    at 0x4E4278D: ??? (in /lib64/libpthread-2.17.so)
==2428==    by 0x40F26F: flb_pipe_close (flb_pipe.c:107)
==2428==    by 0x40B67D: test_timeout_destroy (flb_event_loop.c:106)
==2428==    by 0x40D3F5: test_inject_event_priority_loop (flb_event_loop.c:640)
==2428==    by 0x4096D8: test_do_run_ (acutest.h:1007)
==2428==    by 0x40994D: test_run_ (acutest.h:1103)
==2428==    by 0x40AC35: main (acutest.h:1700)
==2428==
==2428==
==2428== HEAP SUMMARY:
==2428==     in use at exit: 3,728 bytes in 7 blocks
==2428==   total heap usage: 23 allocs, 16 frees, 10,608 bytes allocated
==2428==
==2428== LEAK SUMMARY:
==2428==    definitely lost: 0 bytes in 0 blocks
==2428==    indirectly lost: 0 bytes in 0 blocks
==2428==      possibly lost: 3,648 bytes in 6 blocks
==2428==    still reachable: 80 bytes in 1 blocks
==2428==         suppressed: 0 bytes in 0 blocks
==2428== Rerun with --leak-check=full to see details of leaked memory

Jul 13 '22 21:07 zhenyami

Ready for review. I don't know what to do about the MacOS test failure. I see some build issues are being worked on in other PRs.

Jul 18 '22 15:07 zhenyami

@zhenyami, this change is significant! It's clear you put effort into understanding the event loop system and making appropriate changes.

@edsiper It seems that this is a legitimate problem with the priority event loop that will cause injected events to be dropped in edge cases. Would it be possible prioritizing merging this into Fluent Bit?

The test case looks good.

3 priority 0 events are added, with 1 priority 1 event injected {0,0,0,1}
A loop of max 3 iterations loops over the 3 priority 0 events {0,0,0} (potentially drops event 3+1=4, the injected event)
3 priority 2 events are added {1,2,2,2}
A loop of max 4 iterations loops over the events {1,2,2,2} checking priorities are handled in the correct order.

Aug 15 '22 17:08 matthewfala

@leonardo-albertovich Is there any progress on reviewing this PR. Seems like an important fix to a bug in the event loop. Not sure how widespread issues that may arise might be, but I would think at some point people will complain about fluent bit freezing up and that would be linked back to this PR's highlighted issue.

Sep 06 '22 22:09 matthewfala

I haven't but if it others have that's OK by me.

Sep 07 '22 08:09 leonardo-albertovich

folks, do we have 100% confidence on this PR and tests done ?

I see the unit test here:

https://github.com/fluent/fluent-bit/pull/5649/files#diff-aa41e6d7b7569a427563d189aa551f0563a8d2b3e3f25ea48b556150f308a785R582

but I want to make sure the unit test reproduces the original issue

Sep 09 '22 04:09 edsiper

@matthewfala and @PettitWesley can you confirm if all required tests are done and we have full confidence on this fix.

Sep 09 '22 06:09 lubingfeng

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Dec 09 '22 02:12 github-actions[bot]

up

Dec 16 '22 14:12 yackushevas

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Mar 17 '23 01:03 github-actions[bot]

@matthewfala do you think this is still relevant? Do you think we could wrap it up?

Mar 17 '23 17:03 leonardo-albertovich

@leonardo-albertovich, A while back, I looked into this deeply and believe that the tests are sufficient to cause the bug and show that the solution works.

I'm confident in the presented solution. However in the future we can refactor with the suggested change to make it slightly more readable.

Mar 29 '23 23:03 matthewfala

@matthewfala

Fluent Bit sometimes doesn't resume coroutines for timed out network events. This can also cause deadlocks in some cases, like AWS credentials provider refresh function, if the control is not returned to the function that sets a lock before a network call.

The issue description kind of sounds like the mk_event_inject bug that we found right? https://github.com/fluent/fluent-bit/issues/6822

Mar 30 '23 00:03 PettitWesley

I'm out of the loop so I trust your judgment @matthewfala.

As for the bug mentioned by @PettitWesley, I don't think it's related, of course, the priority event loop is the root of both but that's as far as it goes from what I see. I guess there's a lesson to be learned in it.

Mar 30 '23 06:03 leonardo-albertovich

@PettitWesley I believe these bugs are not related. I think this code is ready to be merged.

Mar 31 '23 19:03 matthewfala

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Jun 30 '23 02:06 github-actions[bot]