redpanda Failure at `ConsumerGroupTest`.`test_dead_group_recovery`: Error occured replicating group {test-gr-1} cleanup records

with debug-clang-amd64:

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (1/46 runs) failure at 2022-10-20T07:16:44.061Z: <BadLogLines nodes=docker-rp-9(1) example="ERROR 2022-10-20 06:10:18,187 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/16958#0183f3c7-0a62-4cba-add4-77b6cc582531

and

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=True (1/46 runs) failure at 2022-10-21T07:16:21.396Z: <BadLogLines nodes=docker-rp-17(1) example="ERROR 2022-10-21 06:11:18,751 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17030#0183f8ee-c891-4730-b835-319e434b4bd3

Oct 21 '22 14:10 andijcr

https://buildkite.com/redpanda/redpanda/builds/17321#018413bb-b448-4c11-a2bd-d995e528b5d3/6-1327

Oct 26 '22 15:10 twmb

https://buildkite.com/redpanda/redpanda/builds/17346#01841514-44c3-43fe-9149-5ca413cb27b3/6-1299

Oct 26 '22 19:10 r-vasquez

Another couple failures with the same failure mode:

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (2/36 runs) failure at 2022-10-28T13:14:32.221Z: <BadLogLines nodes=docker-rp-15(1) example="ERROR 2022-10-28 12:09:33,987 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17516#01841e3c-a9c7-4689-911c-4d39a0df2944 failure at 2022-10-27T19:32:22.027Z: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-10-27 18:26:36,411 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17464#01841a73-7c5a-4ee0-92f2-a1bba1376b98

Oct 28 '22 16:10 VladLazar

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (2/52 runs) failure at 2022-10-28T17:48:25.786Z: <BadLogLines nodes=docker-rp-21(1) example="ERROR 2022-10-28 16:40:30,261 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17537#01841f34-d39d-4994-bb74-b1e3165ab386

Oct 31 '22 18:10 dlex

-240ms: test deletes topic and waits for the CG to become group_state::dead (here)
-140ms: [N:test-gr-1 S:Empty G:2] group.cc:179 - Changing state from Empty to Dead
-40ms: test sees test-gr-1 is dead
-30ms: test kills all nodes to restart them
-0ms: Error occured replicating group test-gr-1 cleanup records: not_enough_replicas (Messages are rejected since there are fewer in-sync replicas than required.)

Nov 02 '22 06:11 dlex

The issue is that the test kills the other two nodes of the cluster before the CG coordinator is able to replicate the tombstones for __consumer_offsets to clean up the offset of the topic being deleted. That happens because group::remove_topic_partitions sets the CG to dead first, and only then replicates the tombstones; the test may begin killing node members as soon as dead status is set on the CG.

There is no issue in RP, the tombstones are appended to the log on the leader; after the cluster is restarted, the CG is still dead. There is no way for the test to wait for the replication to complete (that is not exposed), therefore the best way to resolve is to ignore this log message in this test.

Nov 02 '22 22:11 dlex

Seen in #7014

Nov 03 '22 02:11 twmb

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=True (1/29 runs) failure at 2022-11-03T07:37:41.574Z: <BadLogLines nodes=docker-rp-21(1) example="ERROR 2022-11-03 06:34:36,629 [shard 1] kafka - group.cc:2806 - Error occured replicating group {test-g r-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17827#01843bef-b5c0-4258-b4bc-98e6476f9b6a

Nov 03 '22 19:11 VladLazar

@dlex were you going to open that pull request?

If this is really a condition that is not harmful, we probably should log it as WARN rather than ERROR. If it happens in the field, we don't want our users to think something is broken.

Nov 04 '22 18:11 jcsp

Another instance https://buildkite.com/redpanda/redpanda/builds/17963#01844b59-8cbf-4dfd-b6a6-fa08f944db5e/6-1080

Nov 07 '22 20:11 NyaliaLui

Came back post fix https://buildkite.com/redpanda/redpanda/builds/18457#018465eb-b508-49e1-a86a-23b9a6a3508c/6-580

Nov 11 '22 15:11 NyaliaLui

this in debug-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5bb-49ae-8f4c-ba33c370a696

Nov 15 '22 15:11 andijcr

Another instance:

https://buildkite.com/redpanda/redpanda/builds/18773#018484fb-b9ad-4b5d-b527-3323bdb96254

Maybe we should reopen.

Nov 17 '22 22:11 travisdowns

Given the comments about hitting this again.

Nov 22 '22 04:11 piyushredpanda

another one in a feature branch: https://buildkite.com/redpanda/redpanda/builds/18931#0184a033-7ea6-4795-9286-d0999d69a495

Nov 22 '22 19:11 dlex

Same issue is reported via another stack: what was fixed was in group::remove(), now the error is logged in group::remove_topic_partitions().

Nov 22 '22 23:11 dlex

redpanda redpanda copied to clipboard

Failure at `ConsumerGroupTest`.`test_dead_group_recovery`: Error occured replicating group {test-gr-1} cleanup records

redpanda
redpanda copied to clipboard