redpanda
redpanda copied to clipboard
Failure at `ConsumerGroupTest`.`test_dead_group_recovery`: Error occured replicating group {test-gr-1} cleanup records
with debug-clang-amd64:
FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (1/46 runs) failure at 2022-10-20T07:16:44.061Z: <BadLogLines nodes=docker-rp-9(1) example="ERROR 2022-10-20 06:10:18,187 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/16958#0183f3c7-0a62-4cba-add4-77b6cc582531
and
FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=True (1/46 runs) failure at 2022-10-21T07:16:21.396Z: <BadLogLines nodes=docker-rp-17(1) example="ERROR 2022-10-21 06:11:18,751 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17030#0183f8ee-c891-4730-b835-319e434b4bd3
https://buildkite.com/redpanda/redpanda/builds/17321#018413bb-b448-4c11-a2bd-d995e528b5d3/6-1327
https://buildkite.com/redpanda/redpanda/builds/17346#01841514-44c3-43fe-9149-5ca413cb27b3/6-1299
Another couple failures with the same failure mode:
FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (2/36 runs) failure at 2022-10-28T13:14:32.221Z: <BadLogLines nodes=docker-rp-15(1) example="ERROR 2022-10-28 12:09:33,987 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17516#01841e3c-a9c7-4689-911c-4d39a0df2944 failure at 2022-10-27T19:32:22.027Z: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-10-27 18:26:36,411 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17464#01841a73-7c5a-4ee0-92f2-a1bba1376b98
FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (2/52 runs) failure at 2022-10-28T17:48:25.786Z: <BadLogLines nodes=docker-rp-21(1) example="ERROR 2022-10-28 16:40:30,261 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17537#01841f34-d39d-4994-bb74-b1e3165ab386
- -240ms: test deletes topic and waits for the CG to become
group_state::dead
(here) - -140ms: [N:test-gr-1 S:Empty G:2] group.cc:179 - Changing state from Empty to Dead
- -40ms: test sees
test-gr-1
isdead
- -30ms: test kills all nodes to restart them
- -0ms: Error occured replicating group
test-gr-1
cleanup records:not_enough_replicas
(Messages are rejected since there are fewer in-sync replicas than required.)
The issue is that the test kills the other two nodes of the cluster before the CG coordinator is able to replicate the tombstones for __consumer_offsets
to clean up the offset of the topic being deleted. That happens because group::remove_topic_partitions
sets the CG to dead
first, and only then replicates the tombstones; the test may begin killing node members as soon as dead
status is set on the CG.
There is no issue in RP, the tombstones are appended to the log on the leader; after the cluster is restarted, the CG is still dead
. There is no way for the test to wait for the replication to complete (that is not exposed), therefore the best way to resolve is to ignore this log message in this test.
Seen in #7014
FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=True (1/29 runs) failure at 2022-11-03T07:37:41.574Z: <BadLogLines nodes=docker-rp-21(1) example="ERROR 2022-11-03 06:34:36,629 [shard 1] kafka - group.cc:2806 - Error occured replicating group {test-g r-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17827#01843bef-b5c0-4258-b4bc-98e6476f9b6a
@dlex were you going to open that pull request?
If this is really a condition that is not harmful, we probably should log it as WARN rather than ERROR. If it happens in the field, we don't want our users to think something is broken.
Another instance https://buildkite.com/redpanda/redpanda/builds/17963#01844b59-8cbf-4dfd-b6a6-fa08f944db5e/6-1080
Came back post fix https://buildkite.com/redpanda/redpanda/builds/18457#018465eb-b508-49e1-a86a-23b9a6a3508c/6-580
this in debug-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5bb-49ae-8f4c-ba33c370a696
Another instance:
https://buildkite.com/redpanda/redpanda/builds/18773#018484fb-b9ad-4b5d-b527-3323bdb96254
Maybe we should reopen.
Given the comments about hitting this again.
another one in a feature branch: https://buildkite.com/redpanda/redpanda/builds/18931#0184a033-7ea6-4795-9286-d0999d69a495
Same issue is reported via another stack: what was fixed was in group::remove()
, now the error is logged in group::remove_topic_partitions()
.