redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure at `ConsumerGroupTest`.`test_dead_group_recovery`: Error occured replicating group {test-gr-1} cleanup records

Open andijcr opened this issue 2 years ago • 14 comments

with debug-clang-amd64:

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (1/46 runs) failure at 2022-10-20T07:16:44.061Z: <BadLogLines nodes=docker-rp-9(1) example="ERROR 2022-10-20 06:10:18,187 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/16958#0183f3c7-0a62-4cba-add4-77b6cc582531

and

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=True (1/46 runs) failure at 2022-10-21T07:16:21.396Z: <BadLogLines nodes=docker-rp-17(1) example="ERROR 2022-10-21 06:11:18,751 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17030#0183f8ee-c891-4730-b835-319e434b4bd3

andijcr avatar Oct 21 '22 14:10 andijcr

https://buildkite.com/redpanda/redpanda/builds/17321#018413bb-b448-4c11-a2bd-d995e528b5d3/6-1327

twmb avatar Oct 26 '22 15:10 twmb

https://buildkite.com/redpanda/redpanda/builds/17346#01841514-44c3-43fe-9149-5ca413cb27b3/6-1299

r-vasquez avatar Oct 26 '22 19:10 r-vasquez

Another couple failures with the same failure mode:

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (2/36 runs) failure at 2022-10-28T13:14:32.221Z: <BadLogLines nodes=docker-rp-15(1) example="ERROR 2022-10-28 12:09:33,987 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17516#01841e3c-a9c7-4689-911c-4d39a0df2944 failure at 2022-10-27T19:32:22.027Z: <BadLogLines nodes=docker-rp-14(1) example="ERROR 2022-10-27 18:26:36,411 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17464#01841a73-7c5a-4ee0-92f2-a1bba1376b98

VladLazar avatar Oct 28 '22 16:10 VladLazar

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (2/52 runs) failure at 2022-10-28T17:48:25.786Z: <BadLogLines nodes=docker-rp-21(1) example="ERROR 2022-10-28 16:40:30,261 [shard 1] kafka - group.cc:2601 - Error occured replicating group {test-gr-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17537#01841f34-d39d-4994-bb74-b1e3165ab386

dlex avatar Oct 31 '22 18:10 dlex

  • -240ms: test deletes topic and waits for the CG to become group_state::dead (here)
  • -140ms: [N:test-gr-1 S:Empty G:2] group.cc:179 - Changing state from Empty to Dead
  • -40ms: test sees test-gr-1 is dead
  • -30ms: test kills all nodes to restart them
  • -0ms: Error occured replicating group test-gr-1 cleanup records: not_enough_replicas (Messages are rejected since there are fewer in-sync replicas than required.)

dlex avatar Nov 02 '22 06:11 dlex

The issue is that the test kills the other two nodes of the cluster before the CG coordinator is able to replicate the tombstones for __consumer_offsets to clean up the offset of the topic being deleted. That happens because group::remove_topic_partitions sets the CG to dead first, and only then replicates the tombstones; the test may begin killing node members as soon as dead status is set on the CG.

There is no issue in RP, the tombstones are appended to the log on the leader; after the cluster is restarted, the CG is still dead. There is no way for the test to wait for the replication to complete (that is not exposed), therefore the best way to resolve is to ignore this log message in this test.

dlex avatar Nov 02 '22 22:11 dlex

Seen in #7014

twmb avatar Nov 03 '22 02:11 twmb

FAIL test: ConsumerGroupTest.test_dead_group_recovery.static_members=True (1/29 runs) failure at 2022-11-03T07:37:41.574Z: <BadLogLines nodes=docker-rp-21(1) example="ERROR 2022-11-03 06:34:36,629 [shard 1] kafka - group.cc:2806 - Error occured replicating group {test-g r-1} cleanup records raft::errc:19"> in job https://buildkite.com/redpanda/redpanda/builds/17827#01843bef-b5c0-4258-b4bc-98e6476f9b6a

VladLazar avatar Nov 03 '22 19:11 VladLazar

@dlex were you going to open that pull request?

If this is really a condition that is not harmful, we probably should log it as WARN rather than ERROR. If it happens in the field, we don't want our users to think something is broken.

jcsp avatar Nov 04 '22 18:11 jcsp

Another instance https://buildkite.com/redpanda/redpanda/builds/17963#01844b59-8cbf-4dfd-b6a6-fa08f944db5e/6-1080

NyaliaLui avatar Nov 07 '22 20:11 NyaliaLui

Came back post fix https://buildkite.com/redpanda/redpanda/builds/18457#018465eb-b508-49e1-a86a-23b9a6a3508c/6-580

NyaliaLui avatar Nov 11 '22 15:11 NyaliaLui

this in debug-clang-amd64 https://buildkite.com/redpanda/redpanda/builds/18587#018479bc-e5bb-49ae-8f4c-ba33c370a696

andijcr avatar Nov 15 '22 15:11 andijcr

Another instance:

https://buildkite.com/redpanda/redpanda/builds/18773#018484fb-b9ad-4b5d-b527-3323bdb96254

Maybe we should reopen.

travisdowns avatar Nov 17 '22 22:11 travisdowns

Given the comments about hitting this again.

piyushredpanda avatar Nov 22 '22 04:11 piyushredpanda

another one in a feature branch: https://buildkite.com/redpanda/redpanda/builds/18931#0184a033-7ea6-4795-9286-d0999d69a495

dlex avatar Nov 22 '22 19:11 dlex

Same issue is reported via another stack: what was fixed was in group::remove(), now the error is logged in group::remove_topic_partitions().

dlex avatar Nov 22 '22 23:11 dlex