redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

tests/gtest_raft_rpunit: monitor_test_fixture to honour leadership changes

Open bashtanov opened this issue 1 year ago • 4 comments

  1. added a utility function to raft_fixture to execute a testing coro in retry_with_leader
  2. made both monitor_test_fixture tests use it

I failed to reproduce the original problem but from the logs it's clear leadership was lost. https://redpandadata.atlassian.net/browse/CORE-7666 https://buildkite.com/redpanda/redpanda/builds/54755#01920b02-453f-449d-a6fa-ccfa717ca67d/6-82553

I added the facility to raft_fixture because I'm looking to reuse it in other tests.

he testing coro returns a future<>, so it has to indicate leadership problems via exception. That's for to use marcos like ASSERT_NE_CORO.

The testing coro has to be in a separate subclass. We could clone TEST_P_CORO macro to create a class with two coroutines: the inner one that may throw, and the outer one that calls retry_with_leader with the inner. But that would mean us maintaining both macros. A macro to create a fixture subclass for each test won't work because Gtest wants parameters-related code specifically in the very child subclass.

Backports Required

  • [ ] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [x] none - papercut/not impactful enough to backport
  • [ ] v24.2.x
  • [ ] v24.1.x
  • [ ] v23.3.x

Release Notes

  • none

bashtanov avatar Sep 20 '24 08:09 bashtanov

new failures in https://buildkite.com/redpanda/redpanda/builds/54812#01920ed0-9429-46f8-84ab-7b767336cd44:

"rptest.tests.data_migrations_api_test.DataMigrationsApiTest.test_higher_level_migration_api"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58900#01936f17-106f-4443-83a3-5876f951fd57:

"rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SegmentRolledByTimeout==True"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58931#0193709b-0a35-4256-b675-688b0eabc150:

"rptest.tests.maintenance_test.MaintenanceTest.test_maintenance_sticky.use_rpk=True"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58944#01937247-7ec7-4922-bc53-fe790e46ef8d:

"rptest.tests.consumer_group_balancing_test.ConsumerGroupBalancingTest.test_coordinator_nodes_balance"

vbotbuildovich avatar Sep 20 '24 10:09 vbotbuildovich

failure unrelated, to be fixed in https://github.com/redpanda-data/redpanda/pull/23335 (test too strict)

bashtanov avatar Sep 20 '24 11:09 bashtanov

@bashtanov I think the latest failures are related to this change.

bharathv avatar Sep 23 '24 21:09 bharathv

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56546#01929185-9b73-462d-a4c8-f7b85ee86b90 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/56546#01929185-9b6e-45b0-93f5-45932dda778a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57532#0192f78f-5743-4af6-9f23-9adb50e261af ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57532#0192f794-8344-438e-aeb3-bf45154a39ec ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/57928#01931d21-0aba-40b4-9fd6-2f70ba71bbee ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58110#019330ae-17a0-46b5-87ca-a64663cde501 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58257#019344cd-3718-4117-84fc-bc4d3fbbbf2a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58385#01934b3f-bde7-4351-af08-a2374fa6a24a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544a-12a4-432b-9a63-49305b97e97f ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544e-823c-4438-a9cb-b91afe64ef0c ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58557#0193544e-823a-4661-a874-257f03bb0d44 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58931#019370b6-7c6b-4bf0-81e0-aef78f48f16d

vbotbuildovich avatar Oct 15 '24 20:10 vbotbuildovich

the below tests from https://buildkite.com/redpanda/redpanda/builds/57532#0192f74e-2e3a-4444-9967-1233ed17dfae have failed and will be retried

storage_e2e_single_thread_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/57532#0192f74e-2e37-45d3-b9ec-28b0d4c03a48 have failed and will be retried

partition_balancer_planner_test_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/57617#0192fd54-f8a4-4376-875b-02300660830c have failed and will be retried

gtest_raft_rpunit
datalake_cloud_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/58900#01936eb8-fd53-4046-bb5e-bf789f4f9e3a have failed and will be retried

gtest_raft_rpunit

vbotbuildovich avatar Nov 04 '24 14:11 vbotbuildovich

@ztlpn I've added changes required for all raft tests to build using bazel, could you have another look please?

bashtanov avatar Nov 06 '24 09:11 bashtanov

@rockwotj maybe could you review the bazelization (everything but the first commit, which has been already reviewed by @ztlpn)?

bashtanov avatar Nov 11 '24 17:11 bashtanov

Happy to review the Bazel changes

rockwotj avatar Nov 11 '24 17:11 rockwotj

@rockwotj some of them are flaky indeed. How do you figure out they are not in cmake? (I thought there were some issues with these in cmake as well)

bashtanov avatar Nov 15 '24 15:11 bashtanov

How do you figure out they are not in cmake?

I don't know if we have Jira tickets for them but that's the only way I can think of besides just running the same tests multiple times using the cmake build and making sure the set of flaky tests is the same.

rockwotj avatar Nov 15 '24 16:11 rockwotj

Could you post the list of flaky tests here for posterity?

rockwotj avatar Nov 15 '24 16:11 rockwotj

Sure.

  1. I've had a seastar assert in raft_reconfiguration_test in reconfiguration_test.configuration_replace_test.
  2. Also This test sometimes takes really long time, including more than the bazel limit. Classified as eternal it should time out after 3600s I thought, but right now I have 4000+s and 6000+s instances, any idea how comes?).
  3. I also got some Failed to allocate bytes errors in basic_raft_fixture_test and raft_reconfiguration_test, I presume these might be memory fragmentation problems on my machine?

bashtanov avatar Nov 15 '24 17:11 bashtanov

@dotnwat what is the command that fails with this error? What kind of bisectability does it break?

bashtanov avatar Nov 19 '24 14:11 bashtanov

@dotnwat what is the command that fails with this error?

@bashtanov checkout the commit i referenced, run bazel build //...

What kind of bisectability does it break

the bisectability of the tree. generally we want to strive for every commit to build.

dotnwat avatar Nov 19 '24 17:11 dotnwat

Thanks @dotnwat, I have rearranged the commits so that it builds now

bashtanov avatar Nov 20 '24 18:11 bashtanov

It looks like this commit

commit 0ed25bdbd5fa20d72976e9d8b8d03b4c20abc864 (HEAD)
Author: Alexey Bashtanov <[email protected]>
Date:   Fri Oct 25 14:02:10 2024 +0100

    r/tests: add bazel targets for tests

Doesn't build

ERROR: /home/nwatkins/src/redpanda-bisect-check/redpanda/src/v/datalake/coordinator/tests/BUILD:94:18: Compiling src/v/datalake/coordinator/tests/state_machine_test.cc failed: (Exit 1): cc_wrapper.sh failed: error executing CppCompile command (from target //src/v/datalake/coordinator/tests:state_machine_test) external/toolchains_llvm~~llvm~llvm_18_toolchain/bin/cc_wrapper.sh -U_FORTIFY_SOURCE '--target=x86_64-u
nknown-linux-gnu' -U_FORTIFY_SOURCE -fstack-protector -fno-omit-frame-pointer -fcolor-diagnostics ... (remaining 830 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
src/v/datalake/coordinator/tests/state_machine_test.cc:180:5: error: use of undeclared identifier 'RPTEST_REQUIRE_EVENTUALLY_CORO'
  180 |     RPTEST_REQUIRE_EVENTUALLY_CORO(5s, [this]() {
      |     ^
1 error generated.

dotnwat avatar Nov 21 '24 02:11 dotnwat

yes, but it's not part of the PR anymore, now it builds

bashtanov avatar Nov 21 '24 08:11 bashtanov

/dt

bashtanov avatar Nov 21 '24 11:11 bashtanov

known failures are https://redpandadata.atlassian.net/browse/CORE-8318 and https://redpandadata.atlassian.net/browse/CORE-8319

bashtanov avatar Nov 21 '24 12:11 bashtanov

@bashtanov the bazel build is failing:

/root/.cache/bazel/_bazel_root/7aff2a0765678c91461b80a918fc7d3a/sandbox/linux-sandbox/1964/execroot/_main/src/v/datalake/coordinator/tests/state_machine_test.cc:180:5: error: use of undeclared identifier 'RPTEST_REQUIRE_EVENTUALLY_CORO' [clang-diagnostic-error]
  180 |     RPTEST_REQUIRE_EVENTUALLY_CORO(5s, [this]() {
      |     ^

rockwotj avatar Nov 21 '24 15:11 rockwotj

Thanks both, this problem only reproduced after I rebased on top of latest dev. I've tested every single commit builds locally, so :crossed_fingers: it works this time.

bashtanov avatar Nov 21 '24 17:11 bashtanov

this failure is caused by a gtest linking in the boost test libraries, probably indirectly through a fixture.

FAIL: //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3) (see /root/.cache/bazel/_bazel_root/4219624bc2a11e063c576490f0756711/execroot/_main/bazel-out/k8-fastbuild/testlogs/src/v/datalake/coordinator/tests/state_machine_test/run_1_of_3/test.log)
--
  | INFO: From Testing //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3):
  | ==================== Test output for //src/v/datalake/coordinator/tests:state_machine_test (run 1 of 3):
  | An unrecognized parameter in the argument blocked-reactor-notify-ms
  |  
  |  
  | The program 'state_machine_test' is a Boost.Test module containing unit tests.
  |  
  | Usage
  | state_machine_test [Boost.Test argument]... [-- [custom test module argument]...]
  |  
  | Use
  | state_machine_test --help
  | or  state_machine_test --help=<parameter name>
  | for detailed help on Boost.Test parameters.
  | ================================================================================
  | [14,904 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 159s linux-sandbox ... (48 actions, 23 running)
  | [14,907 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 161s linux-sandbox ... (48 actions, 22 running)
  | [14,908 / 15,048] 376 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 162s linux-sandbox ... (48 actions, 23 running)
  | [14,910 / 15,048] 377 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 163s linux-sandbox ... (48 actions, 24 running)
  | [14,910 / 15,048] 377 / 474 tests; Testing //src/v/wasm/tests:wasm_transform_test (run 3 of 3); 164s linux-sandbox ... (48 actions, 27 running)
  | FAIL: //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3) (see /root/.cache/bazel/_bazel_root/4219624bc2a11e063c576490f0756711/execroot/_main/bazel-out/k8-fastbuild/testlogs/src/v/datalake/coordinator/tests/state_machine_test/run_2_of_3/test.log)
  | INFO: From Testing //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3):
  | ==================== Test output for //src/v/datalake/coordinator/tests:state_machine_test (run 2 of 3):
  | An unrecognized parameter in the argument blocked-reactor-notify-ms
  |  
  |  
  | The program 'state_machine_test' is a Boost.Test module containing unit tests.
  |  
  | Usage
  | state_machine_test [Boost.Test argument]... [-- [custom test module argument]...]
  |  
  | Use
  | state_machine_test --help
  | or  state_machine_test --help=<parameter name>
  | for detailed help on Boost.Test parameters.
  | ================================================================================

dotnwat avatar Nov 21 '24 18:11 dotnwat

Seems I've untangled it. @bharathv in https://github.com/redpanda-data/redpanda/pull/23398/commits/bdda073efd4f435c151875e71e947f8d7fa0e8ea I removed a bazel target you created, so I would appreciate if you could make sure it's okay

bashtanov avatar Nov 22 '24 12:11 bashtanov

could someone have a look at the bazel stuff please? it passed now

bashtanov avatar Nov 25 '24 17:11 bashtanov

yeah Noah bazelized the same tests, I'll need to rebase to see what value is left in this PR

bashtanov avatar Nov 26 '24 09:11 bashtanov

I've changed this PR back to only fixing what is in the description. Bazelization of this and other tests happened in a different PR.

bashtanov avatar Nov 27 '24 15:11 bashtanov

/dt

bashtanov avatar Nov 27 '24 15:11 bashtanov

Retry command for Build#58900

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SegmentRolledByTimeout == True)"}}

vbotbuildovich avatar Nov 27 '24 21:11 vbotbuildovich

the failure is https://redpandadata.atlassian.net/issues/CORE-7833

bashtanov avatar Nov 28 '24 01:11 bashtanov

/ci-repeat 1 tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SegmentRolledByTimeout == True)"}}

bashtanov avatar Nov 28 '24 01:11 bashtanov