redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

DescribeGroups failed in `ConsumerGroupTest.test_dead_group_recovery.static_members=True`

Open ztlpn opened this issue 3 years ago • 19 comments

https://buildkite.com/redpanda/redpanda/builds/11093#018144a0-4373-4133-b2d2-09624ddc39bd

Module: rptest.tests.consumer_group_test
Class:  ConsumerGroupTest
Method: test_dead_group_recovery
Arguments:
{
  "static_members": true
}
RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-03b8eb9dbfa8ff623-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-8:9092,docker-rp-7:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list\n', '')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_group_test.py", line 360, in test_dead_group_recovery
    wait_until(group_is_dead, 30, 2)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/tests/consumer_group_test.py", line 351, in group_is_dead
    rpk_group = rpk.group_describe(group)
  File "/root/tests/rptest/clients/rpk.py", line 383, in group_describe
    rpk_group = try_describe_group(group)
  File "/root/tests/rptest/clients/rpk.py", line 352, in try_describe_group
    out = self._run_group(cmd)
  File "/root/tests/rptest/clients/rpk.py", line 425, in _run_group
    return self._execute(cmd, stdin=stdin, timeout=timeout)
  File "/root/tests/rptest/clients/rpk.py", line 537, in _execute
    raise RpkException(
rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-03b8eb9dbfa8ff623-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-8:9092,docker-rp-7:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list
>

ztlpn avatar Jun 09 '22 12:06 ztlpn

Another instance: https://buildkite.com/redpanda/redpanda/builds/11148#01814a60-009d-4b49-9449-7ca334cb8379 (static_members=False)

bharathv avatar Jun 09 '22 23:06 bharathv

https://buildkite.com/redpanda/redpanda/builds/11114#0181472b-a230-438c-8743-52f2cd4ec388

ztlpn avatar Jun 10 '22 13:06 ztlpn

https://github.com/redpanda-data/redpanda/pull/5094

https://buildkite.com/redpanda/redpanda/builds/11194#01814eb5-a13a-486f-a826-07e518defd78/12935-12937

dotnwat avatar Jun 11 '22 21:06 dotnwat

Also with static_members=False

From PandaResults:

Unstable test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (1/66 runs failed)
  Latest failure at 2022-06-19T07:32:38.146Z: RpkException('command /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0135eef11f8192030-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-21:9092,docker-rp-16:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list\n', '')
             in job https://buildkite.com/redpanda/redpanda/builds/11472#01817a96-21a4-4c30-be02-e67c6070c656

LenaAn avatar Jun 20 '22 21:06 LenaAn

https://github.com/redpanda-data/redpanda/issues/5275 is a duplicate of this one

dlex avatar Jul 07 '22 22:07 dlex

https://buildkite.com/redpanda/redpanda/builds/12246#0181d92a-37c4-4db7-9bf2-b481930a9515

jcsp avatar Jul 08 '22 12:07 jcsp

another one https://buildkite.com/redpanda/redpanda/builds/12507#0181f814-1f9b-4b8a-aa38-cefd70570d07 from https://github.com/redpanda-data/redpanda/pull/5455

LenaAn avatar Jul 13 '22 20:07 LenaAn

The test is run on 3 nodes. The sequence is: create topic - produce and consume in a group - delete topic - restart all nodes - * - create topic. At the moment * rpk group describe fails with the error.

After the restart of the nodes, there are two outcomes possible.

  1. Successful:
note over node1, node2, node3 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 2
...
test -> rpk : describe the-group --brokers 1,2,3
rpk -> node1 : FindCoordinator(the-group)
rpk <-- node1 : coordinator is node 2
rpk -> node1 : MetadataRequest
rpk <-- node1 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 2
rpk -> node3 : MetadataRequest
rpk <-- node3 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 2
test <-- rpk : coordinator: 2, state: dead

svg

  1. Failure:
note over node1 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 1
note over node2 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 2 or 3
note over node3 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 3
...
node3 -> node2 : {kafka/__consumer_offsets/0} Vote request 
node3 -> node1 : {kafka/__consumer_offsets/0} Vote request
node3 <-- node1 : vote_granted: 0 or 1
node3 <-- node2 : vote_granted: 1
note over node3 : becoming the leader
test -> rpk : describe the-group --brokers 1,2,3
rpk -> node2 : MetadataRequest
rpk <-- node2 : nodes: {1,2}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
rpk -> node1 : FindCoordinator(the-group)
rpk <-- node1 : coordinator is node 3
rpk -> node2 : MetadataRequest
rpk <-- node2 : nodes: {1,2}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
rpk -> node1 : MetadataRequest
rpk <-- node1 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
test <-- rpk : "Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list"

svg

In each case, role of each of the nodes is the same.

In the failure case, node2 does not list node3 because its node_state has is_alive: false.

dlex avatar Jul 16 '22 06:07 dlex

Any thoughts on what the solution is? If it is an issue with test, redpanda, or kafka protocol in general?

dotnwat avatar Jul 17 '22 21:07 dotnwat

I can't see the root cause, but I've experienced issues with the correctness of is_alive in the past (internal Slack link). Might be related somehow. I'll try and run my repro of that again.

VladLazar avatar Jul 18 '22 09:07 VladLazar

@dlex is OOO for a week, fyi.

piyushredpanda avatar Jul 18 '22 12:07 piyushredpanda

+1 https://buildkite.com/redpanda/redpanda/builds/12713#01821520-04f4-4297-b21b-fd8e61c5732b

ZeDRoman avatar Jul 19 '22 10:07 ZeDRoman

The rpk group_describe function is meant to be a retry loop, but wasn't correctly handling the case of the coordinator being unavailable -- I hit the same thing with scale testing, and had an un-merged fix handy. Opened #5610 with it.

I haven't dissected the logs from this failure but I wouldn't be surprised if this became more likely when we recently tweaked the leader balancer to move more partitions at once.

jcsp avatar Jul 25 '22 10:07 jcsp

The rpk group_describe function is meant to be a retry loop, but wasn't correctly handling the case of the coordinator being unavailable -- I hit the same thing with scale testing, and had an un-merged fix handy. Opened #5610 with it.

I haven't dissected the logs from this failure but I wouldn't be surprised if this became more likely when we recently tweaked the leader balancer to move more partitions at once.

I think it is the same issue link

RpkException('command /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-07e7288186208f090-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-27:9092,docker-rp-17:9092,docker-rp-28:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: COORDINATOR_NOT_AVAILABLE: The coordinator is not available.\n', '')

VadimPlh avatar Jul 25 '22 10:07 VadimPlh

@jcsp, ok let's see if that helps. Meanwhile I'm trying to run the older code with retry in the test itself to verify whether the coordinator reappears in the list in case of the failure.

dlex avatar Jul 25 '22 14:07 dlex

In my local tests, a single retry was enough to recover in all 3 instances where I have reproduced the error.

dlex avatar Jul 25 '22 17:07 dlex

https://buildkite.com/redpanda/redpanda/builds/13690#01826ed3-0a44-4d24-8436-05555f320793/6-6670

This branch (https://github.com/redpanda-data/redpanda/pull/5841) has the PR #5610 above.

twmb avatar Aug 05 '22 21:08 twmb

https://buildkite.com/redpanda/redpanda/builds/14056#01829136-bfdf-46c5-a75c-152ac4e2570d

VladLazar avatar Aug 12 '22 11:08 VladLazar

I think I ended up the assignee here because of my previous change to the describe groups function, but am not actively working on it.

jcsp avatar Aug 31 '22 12:08 jcsp

Seen on v22.2.x https://buildkite.com/redpanda/redpanda/builds/18771#_

BenPope avatar Nov 17 '22 11:11 BenPope