fluent-bit
fluent-bit copied to clipboard
Unit tests intermittent failures/timeouts
Bug Report
Recently I have noticed flakes on various PR unit tests unrelated to the actual changes, e.g.
- https://github.com/fluent/fluent-bit/actions/runs/15389866212/attempts/1?pr=10418
- https://github.com/fluent/fluent-bit/actions/runs/15408387192/attempts/2?pr=10426
The problem seems to be one of the tests is stuck so eventually the job is cancelled by the 1 hour timeout we have in place.
I have noticed this across various PRs but it does not seem to be within the same combination of options - in the two examples above one is for SIMD=off and the other is using the memory sanitiser.
@pwhelan is there a way we can see which one is stuck/still running in CI? We can trigger that then on cancellation potentially.
This test got cancelled for unknown reasons: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
154/157 Test #157: flb-it-lua ....................................... Passed 0.01 sec
155/157 Test #144: flb-it-stream_processor .......................... Passed 17.62 sec
156/157 Test #71: flb-rt-out_stackdriver ........................... Passed 300.00 sec
Error: The operation was canceled.
The last test must have timed out. I managed to find that the test flb-rt-processor_metrics_selector did not run.
I was able to find the test using the following snippet in fish on a log full of the test results:
for test in $(grep "Start " tests.txt | awk -F ' ' '{print $3}'); do
if ! grep $test tests.txt | grep "Passed\|Failed" 1> /dev/null; then
echo "TEST TIMEOUTD: $test"
fi
done
In another run I see it took 16 seconds:
Start 78: flb-rt-processor_metrics_selector
75/157 Test #75: flb-rt-out_file .................................. Passed 26.00 sec
76/157 Test #77: flb-rt-out_influxdb .............................. Passed 18.00 sec
Start 79: flb-rt-processor_labels
Start 80: flb-rt-processor_content_modifier
77/157 Test #79: flb-rt-processor_labels .......................... Passed 10.00 sec
Start 81: custom_calyptia.sh
78/157 Test #78: flb-rt-processor_metrics_selector ................ Passed 16.00 sec
That might have been just enough to push it outside of the time frame for the test, but in any case if we were to fix this I think tackling flb-rt-out_stackdriver is the best bet. It takes way too long because it has too many test cases which all run serially.
I thought normally they don't seem to hit the timeout so that's why I thought it was stuck somewhere. Eventually we do timeout but my thinking was we were just stuck, it doesn't seem consistent either - was it the same test in both?
It is actually flb-rt-out_stackdriver just taking a long time. It has about ~100 tests which all must have calls to sleep, etc... so it will take at least about 1 and a half minutes to execute at the moment.
We could rewrite all the tests as a single test and parallelize it from there, or group several tests into one.
I am working on a PoC which can serialize acutests but it is not even close to done yet.
Hi,
I noticed that tests fail when executed in parallel, e.g. using:
ctest -j 32
My CMake configuration:
-G "Unix Makefiles"
-DCMAKE_BUILD_TYPE=Debug
-DFLB_TESTS_INTERNAL=On
-DFLB_TESTS_RUNTIME=On
-DFLB_WITHOUT_flb-rt-out_td=1
-DFLB_WITHOUT_flb-rt-out_tcp=1
-DFLB_WITHOUT_flb-rt-in_tcp=1
-DFLB_JEMALLOC=On
-DFLB_DEBUG=On
It looks like some tests run infinitely and then respective job is cancelled due to timeout. If I look at the links provided by @patrick-stephens in description of this issue and then navigate to the failed (cancelled due to timeout) job like https://github.com/fluent/fluent-bit/actions/runs/15124987740/job/42515298979, then I can see these infinitely running tests under "Complete job" step:
Cleaning up orphan processes
Terminate orphan process: pid (2959) (run-unit-tests.sh)
Terminate orphan process: pid (17155) (ctest)
Terminate orphan process: pid (19310) (flb-rt-processor_labels)
Terminate orphan process: pid (19318) (flb-rt-processor_labels)
The oldest commit where I was able to reproduce such behavior of tests is 5798ea426d91005b01df5147ac4f239079db365c and the latest commit where I was not able to reproduce this issue is cf553f782f5c0d67ece9ad53cdec844d07cbf7bf, so I suspect this issue started from (was introduced by) https://github.com/fluent/fluent-bit/pull/9952 (refer to failed tests at CI for this PR - https://github.com/fluent/fluent-bit/actions/runs/15124987740/job/42515298950).
Screenshots of debugger attached to infinitely running test:
Thank you.
I have not had the time to checkout the impact of #9952. It could possibly be exasperating things though by allowing tests to run longer due to the grace period.
Hi @pwhelan,
It seems that https://github.com/fluent/fluent-bit/pull/9952 makes some tests running not longer, but infinitely. Due to this issue happens only when tests are executed in parallel, I suspect some race condition (in engine) and/or incorrect (incompatible with graceful shutdown) test.
Thank you.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.