tests/gtest_raft_rpunit: monitor_test_fixture to honour leadership changes
- added a utility function to raft_fixture to execute a testing coro in
retry_with_leader - made both monitor_test_fixture tests use it
I failed to reproduce the original problem but from the logs it's clear leadership was lost. https://redpandadata.atlassian.net/browse/CORE-7666 https://buildkite.com/redpanda/redpanda/builds/54755#01920b02-453f-449d-a6fa-ccfa717ca67d/6-82553
I added the facility to raft_fixture because I'm looking to reuse it in other tests.
he testing coro returns a future<>, so it has to indicate leadership problems via exception. That's for to use marcos like ASSERT_NE_CORO.
The testing coro has to be in a separate subclass. We could clone TEST_P_CORO macro to create a class with two coroutines: the inner one that may throw, and the outer one that calls retry_with_leader with the inner. But that would mean us maintaining both macros. A macro to create a fixture subclass for each test won't work because Gtest wants parameters-related code specifically in the very child subclass.
Backports Required
- [ ] none - not a bug fix
- [ ] none - this is a backport
- [ ] none - issue does not exist in previous branches
- [x] none - papercut/not impactful enough to backport
- [ ] v24.2.x
- [ ] v24.1.x
- [ ] v23.3.x
Release Notes
- none
new failures in https://buildkite.com/redpanda/redpanda/builds/54812#01920ed0-9429-46f8-84ab-7b767336cd44:
"rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_higher_level_migration_api"
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58900#01936f17-106f-4443-83a3-5876f951fd57:
"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SegmentRolledByTimeout==True"
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58931#0193709b-0a35-4256-b675-688b0eabc150:
"rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True"
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58944#01937247-7ec7-4922-bc53-fe790e46ef8d:
"rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance"
failure unrelated, to be fixed in https://github.com/redpanda-data/redpanda/pull/23335 (test too strict)
@bashtanov I think the latest failures are related to this change.
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56546#01929185-9b73-462d-a4c8-f7b85ee86b90 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56546#01929185-9b6e-45b0-93f5-45932dda778a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57532#0192f78f-5743-4af6-9f23-9adb50e261af ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57532#0192f794-8344-438e-aeb3-bf45154a39ec ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57928#01931d21-0aba-40b4-9fd6-2f70ba71bbee ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58110#019330ae-17a0-46b5-87ca-a64663cde501 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58257#019344cd-3718-4117-84fc-bc4d3fbbbf2a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58385#01934b3f-bde7-4351-af08-a2374fa6a24a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544a-12a4-432b-9a63-49305b97e97f ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544e-823c-4438-a9cb-b91afe64ef0c ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544e-823a-4661-a874-257f03bb0d44 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58931#019370b6-7c6b-4bf0-81e0-aef78f48f16d
the below tests from https://buildkite.com/redpanda/redpanda/builds/57532#0192f74e-2e3a-4444-9967-1233ed17dfae have failed and will be retried
storage_e2e_single_thread_rpunit
the below tests from https://buildkite.com/redpanda/redpanda/builds/57532#0192f74e-2e37-45d3-b9ec-28b0d4c03a48 have failed and will be retried
partition_balancer_planner_test_rpunit
the below tests from https://buildkite.com/redpanda/redpanda/builds/57617#0192fd54-f8a4-4376-875b-02300660830c have failed and will be retried
gtest_raft_rpunit
datalake_cloud_rpunit
the below tests from https://buildkite.com/redpanda/redpanda/builds/58900#01936eb8-fd53-4046-bb5e-bf789f4f9e3a have failed and will be retried
gtest_raft_rpunit
@ztlpn I've added changes required for all raft tests to build using bazel, could you have another look please?
@rockwotj maybe could you review the bazelization (everything but the first commit, which has been already reviewed by @ztlpn)?
Happy to review the Bazel changes
@rockwotj some of them are flaky indeed. How do you figure out they are not in cmake? (I thought there were some issues with these in cmake as well)
How do you figure out they are not in cmake?
I don't know if we have Jira tickets for them but that's the only way I can think of besides just running the same tests multiple times using the cmake build and making sure the set of flaky tests is the same.
Could you post the list of flaky tests here for posterity?
Sure.
- I've had a seastar assert in
raft_reconfiguration_testinreconfiguration_test.configuration_replace_test. - Also This test sometimes takes really long time, including more than the bazel limit. Classified as
eternalit should time out after 3600s I thought, but right now I have 4000+s and 6000+s instances, any idea how comes?). - I also got some
Failed to allocate byteserrors inbasic_raft_fixture_testandraft_reconfiguration_test, I presume these might be memory fragmentation problems on my machine?
@dotnwat what is the command that fails with this error? What kind of bisectability does it break?
@dotnwat what is the command that fails with this error?
@bashtanov checkout the commit i referenced, run bazel build //...
What kind of bisectability does it break
the bisectability of the tree. generally we want to strive for every commit to build.
Thanks @dotnwat, I have rearranged the commits so that it builds now
It looks like this commit
commit 0ed25bdbd5fa20d72976e9d8b8d03b4c20abc864 (HEAD)
Author: Alexey Bashtanov <[email protected]>
Date: Fri Oct 25 14:02:10 2024 +0100
r/tests: add bazel targets for tests
Doesn't build
ERROR: /home/nwatkins/src/redpanda-bisect-check/redpanda/src/v/datalake/coordinator/tests/BUILD:94:18: Compiling src/v/datalake/coordinator/tests/state_machine_test.cc failed: (Exit 1): cc_wrapper.sh failed: error executing CppCompile command (from target //src/v/datalake/coordinator/tests:state_machine_test) external/toolchains_llvm~~llvm~llvm_18_toolchain/bin/cc_wrapper.sh -U_FORTIFY_SOURCE '--target=x86_64-u
nknown-linux-gnu' -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics ... (remaining 830 arguments skipped)
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
src/v/datalake/coordinator/tests/state_machine_test.cc:180:5: error: use of undeclared identifier 'RPTEST_REQUIRE_EVENTUALLY_CORO'
180 | RPTEST_REQUIRE_EVENTUALLY_CORO(5s, [this]() {
| ^
1 error generated.
yes, but it's not part of the PR anymore, now it builds
/dt
known failures are https://redpandadata.atlassian.net/browse/CORE-8318 and https://redpandadata.atlassian.net/browse/CORE-8319
@bashtanov the bazel build is failing:
/root/.cache/bazel/_bazel_root/7aff2a0765678c91461b80a918fc7d3a/sandbox/linux-sandbox/1964/execroot/_main/src/v/datalake/coordinator/tests/state_machine_test.cc:180:5: error: use of undeclared identifier 'RPTEST_REQUIRE_EVENTUALLY_CORO' [clang-diagnostic-error]
180 | RPTEST_REQUIRE_EVENTUALLY_CORO(5s, [this]() {
| ^
Thanks both, this problem only reproduced after I rebased on top of latest dev. I've tested every single commit builds locally, so :crossed_fingers: it works this time.
this failure is caused by a gtest linking in the boost test libraries, probably indirectly through a fixture.
FAIL: //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3) (see /root/.cache/bazel/_bazel_root/4219624bc2a11e063c576490f0756711/execroot/_main/bazel-out/k8-fastbuild/testlogs/src/v/datalake/coordinator/tests/state_machine_test/run_1_of_3/test.log)
--
| INFO: From Testing //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3):
| ==================== Test output for //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3):
| An unrecognized parameter in the argument blocked-reactor-notify-ms
|
|
| The program 'state_machine_test' is a Boost.Test module containing unit tests.
|
| Usage
| state_machine_test [Boost.Test argument]... [-- [custom test module argument]...]
|
| Use
| state_machine_test --help
| or state_machine_test --help=<parameter name>
| for detailed help on Boost.Test parameters.
| ================================================================================
| [14,904 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 159s linux-sandbox ... (48 actions, 23 running)
| [14,907 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 161s linux-sandbox ... (48 actions, 22 running)
| [14,908 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 162s linux-sandbox ... (48 actions, 23 running)
| [14,910 / 15,048] 377 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 163s linux-sandbox ... (48 actions, 24 running)
| [14,910 / 15,048] 377 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 164s linux-sandbox ... (48 actions, 27 running)
| FAIL: //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3) (see /root/.cache/bazel/_bazel_root/4219624bc2a11e063c576490f0756711/execroot/_main/bazel-out/k8-fastbuild/testlogs/src/v/datalake/coordinator/tests/state_machine_test/run_2_of_3/test.log)
| INFO: From Testing //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3):
| ==================== Test output for //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3):
| An unrecognized parameter in the argument blocked-reactor-notify-ms
|
|
| The program 'state_machine_test' is a Boost.Test module containing unit tests.
|
| Usage
| state_machine_test [Boost.Test argument]... [-- [custom test module argument]...]
|
| Use
| state_machine_test --help
| or state_machine_test --help=<parameter name>
| for detailed help on Boost.Test parameters.
| ================================================================================
Seems I've untangled it. @bharathv in https://github.com/redpanda-data/redpanda/pull/23398/commits/bdda073efd4f435c151875e71e947f8d7fa0e8ea I removed a bazel target you created, so I would appreciate if you could make sure it's okay
could someone have a look at the bazel stuff please? it passed now
yeah Noah bazelized the same tests, I'll need to rebase to see what value is left in this PR
I've changed this PR back to only fixing what is in the description. Bazelization of this and other tests happened in a different PR.
/dt
Retry command for Build#58900
please wait until all jobs are finished before running the slash command
/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SegmentRolledByTimeout == True)"}}
the failure is https://redpandadata.atlassian.net/issues/CORE-7833
/ci-repeat 1 tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SegmentRolledByTimeout == True)"}}