redpanda tests/gtest_raft_rpunit: monitor_test_fixture to honour leadership changes

added a utility function to raft_fixture to execute a testing coro in retry_with_leader
made both monitor_test_fixture tests use it

I failed to reproduce the original problem but from the logs it's clear leadership was lost. https://redpandadata.atlassian.net/browse/CORE-7666 https://buildkite.com/redpanda/redpanda/builds/54755#01920b02-453f-449d-a6fa-ccfa717ca67d/6-82553

I added the facility to raft_fixture because I'm looking to reuse it in other tests.

he testing coro returns a future<>, so it has to indicate leadership problems via exception. That's for to use marcos like ASSERT_NE_CORO.

The testing coro has to be in a separate subclass. We could clone TEST_P_CORO macro to create a class with two coroutines: the inner one that may throw, and the outer one that calls retry_with_leader with the inner. But that would mean us maintaining both macros. A macro to create a fixture subclass for each test won't work because Gtest wants parameters-related code specifically in the very child subclass.

Backports Required

[ ] none - not a bug fix
[ ] none - this is a backport
[ ] none - issue does not exist in previous branches
[x] none - papercut/not impactful enough to backport
[ ] v24.2.x
[ ] v24.1.x
[ ] v23.3.x

Release Notes

none

Sep 20 '24 08:09 bashtanov

new failures in https://buildkite.com/redpanda/redpanda/builds/54812#01920ed0-9429-46f8-84ab-7b767336cd44:

"rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_higher_level_migration_api"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58900#01936f17-106f-4443-83a3-5876f951fd57:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SegmentRolledByTimeout==True"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58931#0193709b-0a35-4256-b675-688b0eabc150:

"rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58944#01937247-7ec7-4922-bc53-fe790e46ef8d:

"rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance"

Sep 20 '24 10:09 vbotbuildovich

failure unrelated, to be fixed in https://github.com/redpanda-data/redpanda/pull/23335 (test too strict)

Sep 20 '24 11:09 bashtanov

@bashtanov I think the latest failures are related to this change.

Sep 23 '24 21:09 bharathv

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56546#01929185-9b73-462d-a4c8-f7b85ee86b90 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56546#01929185-9b6e-45b0-93f5-45932dda778a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57532#0192f78f-5743-4af6-9f23-9adb50e261af ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57532#0192f794-8344-438e-aeb3-bf45154a39ec ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57928#01931d21-0aba-40b4-9fd6-2f70ba71bbee ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58110#019330ae-17a0-46b5-87ca-a64663cde501 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58257#019344cd-3718-4117-84fc-bc4d3fbbbf2a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58385#01934b3f-bde7-4351-af08-a2374fa6a24a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544a-12a4-432b-9a63-49305b97e97f ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544e-823c-4438-a9cb-b91afe64ef0c ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544e-823a-4661-a874-257f03bb0d44 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58931#019370b6-7c6b-4bf0-81e0-aef78f48f16d

Oct 15 '24 20:10 vbotbuildovich

the below tests from https://buildkite.com/redpanda/redpanda/builds/57532#0192f74e-2e3a-4444-9967-1233ed17dfae have failed and will be retried

storage_e2e_single_thread_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/57532#0192f74e-2e37-45d3-b9ec-28b0d4c03a48 have failed and will be retried

partition_balancer_planner_test_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/57617#0192fd54-f8a4-4376-875b-02300660830c have failed and will be retried

gtest_raft_rpunit
datalake_cloud_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/58900#01936eb8-fd53-4046-bb5e-bf789f4f9e3a have failed and will be retried

gtest_raft_rpunit

Nov 04 '24 14:11 vbotbuildovich

@ztlpn I've added changes required for all raft tests to build using bazel, could you have another look please?

Nov 06 '24 09:11 bashtanov

@rockwotj maybe could you review the bazelization (everything but the first commit, which has been already reviewed by @ztlpn)?

Nov 11 '24 17:11 bashtanov

Happy to review the Bazel changes

Nov 11 '24 17:11 rockwotj

@rockwotj some of them are flaky indeed. How do you figure out they are not in cmake? (I thought there were some issues with these in cmake as well)

Nov 15 '24 15:11 bashtanov

How do you figure out they are not in cmake?

I don't know if we have Jira tickets for them but that's the only way I can think of besides just running the same tests multiple times using the cmake build and making sure the set of flaky tests is the same.

Nov 15 '24 16:11 rockwotj

Could you post the list of flaky tests here for posterity?

Nov 15 '24 16:11 rockwotj

Sure.

I've had a seastar assert in raft_reconfiguration_test in reconfiguration_test.configuration_replace_test.
Also This test sometimes takes really long time, including more than the bazel limit. Classified as eternal it should time out after 3600s I thought, but right now I have 4000+s and 6000+s instances, any idea how comes?).
I also got some Failed to allocate bytes errors in basic_raft_fixture_test and raft_reconfiguration_test, I presume these might be memory fragmentation problems on my machine?

Nov 15 '24 17:11 bashtanov

@dotnwat what is the command that fails with this error? What kind of bisectability does it break?

Nov 19 '24 14:11 bashtanov

@dotnwat what is the command that fails with this error?

@bashtanov checkout the commit i referenced, run bazel build //...

What kind of bisectability does it break

the bisectability of the tree. generally we want to strive for every commit to build.

Nov 19 '24 17:11 dotnwat

Thanks @dotnwat, I have rearranged the commits so that it builds now

Nov 20 '24 18:11 bashtanov

It looks like this commit

commit 0ed25bdbd5fa20d72976e9d8b8d03b4c20abc864 (HEAD)
Author: Alexey Bashtanov <[email protected]>
Date:   Fri Oct 25 14:02:10 2024 +0100

    r/tests: add bazel targets for tests

Doesn't build

ERROR: /home/nwatkins/src/redpanda-bisect-check/redpanda/src/v/datalake/coordinator/tests/BUILD:94:18: Compiling src/v/datalake/coordinator/tests/state_machine_test.cc failed: (Exit 1): cc_wrapper.sh failed: error executing CppCompile command (from target //src/v/datalake/coordinator/tests:state_machine_test) external/toolchains_llvm~~llvm~llvm_18_toolchain/bin/cc_wrapper.sh -U_FORTIFY_SOURCE '--target=x86_64-u
nknown-linux-gnu' -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics ... (remaining 830 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
src/v/datalake/coordinator/tests/state_machine_test.cc:180:5: error: use of undeclared identifier 'RPTEST_REQUIRE_EVENTUALLY_CORO'
  180 |     RPTEST_REQUIRE_EVENTUALLY_CORO(5s, [this]() {
      |     ^
1 error generated.

Nov 21 '24 02:11 dotnwat

yes, but it's not part of the PR anymore, now it builds

Nov 21 '24 08:11 bashtanov

/dt

Nov 21 '24 11:11 bashtanov

known failures are https://redpandadata.atlassian.net/browse/CORE-8318 and https://redpandadata.atlassian.net/browse/CORE-8319

Nov 21 '24 12:11 bashtanov

@bashtanov the bazel build is failing:

/root/.cache/bazel/_bazel_root/7aff2a0765678c91461b80a918fc7d3a/sandbox/linux-sandbox/1964/execroot/_main/src/v/datalake/coordinator/tests/state_machine_test.cc:180:5: error: use of undeclared identifier 'RPTEST_REQUIRE_EVENTUALLY_CORO' [clang-diagnostic-error]
  180 |     RPTEST_REQUIRE_EVENTUALLY_CORO(5s, [this]() {
      |     ^

Nov 21 '24 15:11 rockwotj

Thanks both, this problem only reproduced after I rebased on top of latest dev. I've tested every single commit builds locally, so :crossed_fingers: it works this time.

Nov 21 '24 17:11 bashtanov

this failure is caused by a gtest linking in the boost test libraries, probably indirectly through a fixture.

FAIL: //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3) (see /root/.cache/bazel/_bazel_root/4219624bc2a11e063c576490f0756711/execroot/_main/bazel-out/k8-fastbuild/testlogs/src/v/datalake/coordinator/tests/state_machine_test/run_1_of_3/test.log)
--
  | INFO: From Testing //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3):
  | ==================== Test output for //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3):
  | An unrecognized parameter in the argument blocked-reactor-notify-ms
  |  
  |  
  | The program 'state_machine_test' is a Boost.Test module containing unit tests.
  |  
  | Usage
  | state_machine_test [Boost.Test argument]... [-- [custom test module argument]...]
  |  
  | Use
  | state_machine_test --help
  | or  state_machine_test --help=<parameter name>
  | for detailed help on Boost.Test parameters.
  | ================================================================================
  | [14,904 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 159s linux-sandbox ... (48 actions, 23 running)
  | [14,907 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 161s linux-sandbox ... (48 actions, 22 running)
  | [14,908 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 162s linux-sandbox ... (48 actions, 23 running)
  | [14,910 / 15,048] 377 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 163s linux-sandbox ... (48 actions, 24 running)
  | [14,910 / 15,048] 377 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 164s linux-sandbox ... (48 actions, 27 running)
  | FAIL: //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3) (see /root/.cache/bazel/_bazel_root/4219624bc2a11e063c576490f0756711/execroot/_main/bazel-out/k8-fastbuild/testlogs/src/v/datalake/coordinator/tests/state_machine_test/run_2_of_3/test.log)
  | INFO: From Testing //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3):
  | ==================== Test output for //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3):
  | An unrecognized parameter in the argument blocked-reactor-notify-ms
  |  
  |  
  | The program 'state_machine_test' is a Boost.Test module containing unit tests.
  |  
  | Usage
  | state_machine_test [Boost.Test argument]... [-- [custom test module argument]...]
  |  
  | Use
  | state_machine_test --help
  | or  state_machine_test --help=<parameter name>
  | for detailed help on Boost.Test parameters.
  | ================================================================================

Nov 21 '24 18:11 dotnwat

Seems I've untangled it. @bharathv in https://github.com/redpanda-data/redpanda/pull/23398/commits/bdda073efd4f435c151875e71e947f8d7fa0e8ea I removed a bazel target you created, so I would appreciate if you could make sure it's okay

Nov 22 '24 12:11 bashtanov

could someone have a look at the bazel stuff please? it passed now

Nov 25 '24 17:11 bashtanov

yeah Noah bazelized the same tests, I'll need to rebase to see what value is left in this PR

Nov 26 '24 09:11 bashtanov

I've changed this PR back to only fixing what is in the description. Bazelization of this and other tests happened in a different PR.

Nov 27 '24 15:11 bashtanov

/dt

Nov 27 '24 15:11 bashtanov

Retry command for Build#58900

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SegmentRolledByTimeout == True)"}}

Nov 27 '24 21:11 vbotbuildovich

the failure is https://redpandadata.atlassian.net/issues/CORE-7833

Nov 28 '24 01:11 bashtanov

/ci-repeat 1 tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SegmentRolledByTimeout == True)"}}

Nov 28 '24 01:11 bashtanov