redpanda
redpanda copied to clipboard
Test BadLogLines failures with uncaught raft::offset_monitor::wait_aborted (FranzGoVerifiableWithSiTest.test_si_with_timeboxed, PartitionBalancerTest.test_fuzz_admin_ops)
rptest.scale_tests.franz_go_verifiable_test.FranzGoVerifiableWithSiTest.test_si_with_timeboxed.segment_size=10485760
<BadLogLines nodes=ip-172-31-58-10(3) example="ERROR 2022-06-14 07:28:06,896
[shard 0] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)">
This is similar to #4489; both cases have the offset monitor wait aborted
exception.
Reproduced in CDT here.
Assigning to @ZeDRoman, as #4489 was something he was looking at.
This uncaught exception is still in the code. I'm currently seeing it around the same time as I start a bunch of clients doing idempotent writes.
I have a mixture of caught wait_aborted exceptions coming from the id_allocator machinery, and then some uncaught ones making it up to the RPC handler that's logging these as ERROR:
WARN 2022-08-04 19:44:23,752 [shard 0] cluster - id_allocator_frontend.cc:252 - can not create {kafka_internal}/{id_allocator} topic - error: raft::offset_monitor::wait_aborted (offset monitor wait aborted)
WARN 2022-08-04 19:44:23,752 [shard 0] cluster - id_allocator_frontend.cc:70 - can't find {ns: {kafka_internal}, topic: {id_allocator}} in the metadata cache
WARN 2022-08-04 19:44:23,752 [shard 0] kafka - init_producer_id.cc:114 - failed to allocate pid, ec: cluster::errc:14
ERROR 2022-08-04 19:44:23,772 [shard 1] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)
FAIL test: PartitionBalancerTest.test_fuzz_admin_ops (2/37 runs) failure at 2022-08-05T07:48:34.288Z: <BadLogLines nodes=docker-rp-8(1) example="ERROR 2022-08-05 06:32:44,034 [shard 0] rpc - Service handler threw an exception: raft::offset_monitor::wait_aborted (offset monitor wait aborted)"> in job https://buildkite.com/redpanda/redpanda/builds/13659#01826c88-355c-4b07-a514-c884579adabb
Relevant discussion about the wait_aborted
exception: https://github.com/redpanda-data/redpanda/pull/6367#discussion_r971280515