fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

Unit tests intermittent failures/timeouts

Open patrick-stephens opened this issue 5 months ago • 5 comments

Bug Report

Recently I have noticed flakes on various PR unit tests unrelated to the actual changes, e.g.

  • https://github.com/fluent/fluent-bit/actions/runs/15389866212/attempts/1?pr=10418
  • https://github.com/fluent/fluent-bit/actions/runs/15408387192/attempts/2?pr=10426

The problem seems to be one of the tests is stuck so eventually the job is cancelled by the 1 hour timeout we have in place.

I have noticed this across various PRs but it does not seem to be within the same combination of options - in the two examples above one is for SIMD=off and the other is using the memory sanitiser.

patrick-stephens avatar Jun 04 '25 10:06 patrick-stephens

@pwhelan is there a way we can see which one is stuck/still running in CI? We can trigger that then on cancellation potentially.

patrick-stephens avatar Jun 04 '25 10:06 patrick-stephens

This test got cancelled for unknown reasons: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)

154/157 Test #157: flb-it-lua .......................................   Passed    0.01 sec
155/157 Test #144: flb-it-stream_processor ..........................   Passed   17.62 sec
156/157 Test  #71: flb-rt-out_stackdriver ...........................   Passed  300.00 sec
Error: The operation was canceled.

The last test must have timed out. I managed to find that the test flb-rt-processor_metrics_selector did not run.

I was able to find the test using the following snippet in fish on a log full of the test results:

for test in $(grep "Start " tests.txt | awk -F ' ' '{print $3}'); do
    if ! grep $test tests.txt | grep "Passed\|Failed" 1> /dev/null; then
        echo "TEST TIMEOUTD: $test"
    fi
done

In another run I see it took 16 seconds:

        Start  78: flb-rt-processor_metrics_selector
 75/157 Test  #75: flb-rt-out_file ..................................   Passed   26.00 sec
 76/157 Test  #77: flb-rt-out_influxdb ..............................   Passed   18.00 sec
        Start  79: flb-rt-processor_labels
        Start  80: flb-rt-processor_content_modifier
 77/157 Test  #79: flb-rt-processor_labels ..........................   Passed   10.00 sec
        Start  81: custom_calyptia.sh
 78/157 Test  #78: flb-rt-processor_metrics_selector ................   Passed   16.00 sec

That might have been just enough to push it outside of the time frame for the test, but in any case if we were to fix this I think tackling flb-rt-out_stackdriver is the best bet. It takes way too long because it has too many test cases which all run serially.

pwhelan avatar Jun 04 '25 14:06 pwhelan

I thought normally they don't seem to hit the timeout so that's why I thought it was stuck somewhere. Eventually we do timeout but my thinking was we were just stuck, it doesn't seem consistent either - was it the same test in both?

patrick-stephens avatar Jun 05 '25 18:06 patrick-stephens

It is actually flb-rt-out_stackdriver just taking a long time. It has about ~100 tests which all must have calls to sleep, etc... so it will take at least about 1 and a half minutes to execute at the moment.

We could rewrite all the tests as a single test and parallelize it from there, or group several tests into one.

I am working on a PoC which can serialize acutests but it is not even close to done yet.

pwhelan avatar Jun 13 '25 18:06 pwhelan

Hi,

I noticed that tests fail when executed in parallel, e.g. using:

ctest -j 32

My CMake configuration:

-G "Unix Makefiles"
-DCMAKE_BUILD_TYPE=Debug
-DFLB_TESTS_INTERNAL=On
-DFLB_TESTS_RUNTIME=On
-DFLB_WITHOUT_flb-rt-out_td=1
-DFLB_WITHOUT_flb-rt-out_tcp=1
-DFLB_WITHOUT_flb-rt-in_tcp=1
-DFLB_JEMALLOC=On
-DFLB_DEBUG=On

It looks like some tests run infinitely and then respective job is cancelled due to timeout. If I look at the links provided by @patrick-stephens in description of this issue and then navigate to the failed (cancelled due to timeout) job like https://github.com/fluent/fluent-bit/actions/runs/15124987740/job/42515298979, then I can see these infinitely running tests under "Complete job" step:

Cleaning up orphan processes
Terminate orphan process: pid (2959) (run-unit-tests.sh)
Terminate orphan process: pid (17155) (ctest)
Terminate orphan process: pid (19310) (flb-rt-processor_labels)
Terminate orphan process: pid (19318) (flb-rt-processor_labels)

The oldest commit where I was able to reproduce such behavior of tests is 5798ea426d91005b01df5147ac4f239079db365c and the latest commit where I was not able to reproduce this issue is cf553f782f5c0d67ece9ad53cdec844d07cbf7bf, so I suspect this issue started from (was introduced by) https://github.com/fluent/fluent-bit/pull/9952 (refer to failed tests at CI for this PR - https://github.com/fluent/fluent-bit/actions/runs/15124987740/job/42515298950).

Screenshots of debugger attached to infinitely running test:

Image

Image

Thank you.

mabrarov avatar Jun 13 '25 20:06 mabrarov

I have not had the time to checkout the impact of #9952. It could possibly be exasperating things though by allowing tests to run longer due to the grace period.

pwhelan avatar Jul 01 '25 15:07 pwhelan

Hi @pwhelan,

It seems that https://github.com/fluent/fluent-bit/pull/9952 makes some tests running not longer, but infinitely. Due to this issue happens only when tests are executed in parallel, I suspect some race condition (in engine) and/or incorrect (incompatible with graceful shutdown) test.

Thank you.

mabrarov avatar Jul 01 '25 16:07 mabrarov

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] avatar Oct 25 '25 02:10 github-actions[bot]