redpanda
                                
                                
                                
                                    redpanda copied to clipboard
                            
                            
                            
                        DescribeGroups failed in `ConsumerGroupTest.test_dead_group_recovery.static_members=True`
https://buildkite.com/redpanda/redpanda/builds/11093#018144a0-4373-4133-b2d2-09624ddc39bd
Module: rptest.tests.consumer_group_test
Class:  ConsumerGroupTest
Method: test_dead_group_recovery
Arguments:
{
  "static_members": true
}
RpkException('command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-03b8eb9dbfa8ff623-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-8:9092,docker-rp-7:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list\n', '')
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/consumer_group_test.py", line 360, in test_dead_group_recovery
    wait_until(group_is_dead, 30, 2)
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 53, in wait_until
    raise e
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 44, in wait_until
    if condition():
  File "/root/tests/rptest/tests/consumer_group_test.py", line 351, in group_is_dead
    rpk_group = rpk.group_describe(group)
  File "/root/tests/rptest/clients/rpk.py", line 383, in group_describe
    rpk_group = try_describe_group(group)
  File "/root/tests/rptest/clients/rpk.py", line 352, in try_describe_group
    out = self._run_group(cmd)
  File "/root/tests/rptest/clients/rpk.py", line 425, in _run_group
    return self._execute(cmd, stdin=stdin, timeout=timeout)
  File "/root/tests/rptest/clients/rpk.py", line 537, in _execute
    raise RpkException(
rptest.clients.rpk.RpkException: RpkException<command /var/lib/buildkite-agent/builds/buildkite-amd64-xfs-builders-i-03b8eb9dbfa8ff623-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-8:9092,docker-rp-7:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list
>
                                    
                                    
                                    
                                
Another instance: https://buildkite.com/redpanda/redpanda/builds/11148#01814a60-009d-4b49-9449-7ca334cb8379 (static_members=False)
https://buildkite.com/redpanda/redpanda/builds/11114#0181472b-a230-438c-8743-52f2cd4ec388
https://github.com/redpanda-data/redpanda/pull/5094
https://buildkite.com/redpanda/redpanda/builds/11194#01814eb5-a13a-486f-a826-07e518defd78/12935-12937
Also with static_members=False
From PandaResults:
Unstable test: ConsumerGroupTest.test_dead_group_recovery.static_members=False (1/66 runs failed)
  Latest failure at 2022-06-19T07:32:38.146Z: RpkException('command /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-0135eef11f8192030-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-17:9092,docker-rp-21:9092,docker-rp-16:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list\n', '')
             in job https://buildkite.com/redpanda/redpanda/builds/11472#01817a96-21a4-4c30-be02-e67c6070c656
                                    
                                    
                                    
                                
https://github.com/redpanda-data/redpanda/issues/5275 is a duplicate of this one
https://buildkite.com/redpanda/redpanda/builds/12246#0181d92a-37c4-4db7-9bf2-b481930a9515
another one https://buildkite.com/redpanda/redpanda/builds/12507#0181f814-1f9b-4b8a-aa38-cefd70570d07 from https://github.com/redpanda-data/redpanda/pull/5455
The test is run on 3 nodes. The sequence is: create topic - produce and consume in a group - delete topic - restart all nodes - * - create topic. At the moment * rpk group describe fails with the error.
After the restart of the nodes, there are two outcomes possible.
- Successful:
 
note over node1, node2, node3 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 2
...
test -> rpk : describe the-group --brokers 1,2,3
rpk -> node1 : FindCoordinator(the-group)
rpk <-- node1 : coordinator is node 2
rpk -> node1 : MetadataRequest
rpk <-- node1 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 2
rpk -> node3 : MetadataRequest
rpk <-- node3 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 2
test <-- rpk : coordinator: 2, state: dead
- Failure:
 
note over node1 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 1
note over node2 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 2 or 3
note over node3 : {kafka/__consumer_offsets/0} Recovered persistent state from kvstore: voted for: 3
...
node3 -> node2 : {kafka/__consumer_offsets/0} Vote request 
node3 -> node1 : {kafka/__consumer_offsets/0} Vote request
node3 <-- node1 : vote_granted: 0 or 1
node3 <-- node2 : vote_granted: 1
note over node3 : becoming the leader
test -> rpk : describe the-group --brokers 1,2,3
rpk -> node2 : MetadataRequest
rpk <-- node2 : nodes: {1,2}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
rpk -> node1 : FindCoordinator(the-group)
rpk <-- node1 : coordinator is node 3
rpk -> node2 : MetadataRequest
rpk <-- node2 : nodes: {1,2}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
rpk -> node1 : MetadataRequest
rpk <-- node1 : nodes: {1,2,3}, controller: 1, {kafka/__consumer_offsets/0} leader: 3
test <-- rpk : "Kafka replied that group test-gr-1 has broker coordinator 3, but did not reply with that broker in the broker list"
In each case, role of each of the nodes is the same.
In the failure case, node2 does not list node3 because its node_state has is_alive: false.
Any thoughts on what the solution is? If it is an issue with test, redpanda, or kafka protocol in general?
I can't see the root cause, but I've experienced issues with the correctness of is_alive in the past (internal Slack link). Might be related somehow. I'll try and run my repro of that again.
@dlex is OOO for a week, fyi.
+1 https://buildkite.com/redpanda/redpanda/builds/12713#01821520-04f4-4297-b21b-fd8e61c5732b
The rpk group_describe function is meant to be a retry loop, but wasn't correctly handling the case of the coordinator being unavailable -- I hit the same thing with scale testing, and had an un-merged fix handy. Opened #5610 with it.
I haven't dissected the logs from this failure but I wouldn't be surprised if this became more likely when we recently tweaked the leader balancer to move more partitions at once.
The rpk group_describe function is meant to be a retry loop, but wasn't correctly handling the case of the coordinator being unavailable -- I hit the same thing with scale testing, and had an un-merged fix handy. Opened #5610 with it.
I haven't dissected the logs from this failure but I wouldn't be surprised if this became more likely when we recently tweaked the leader balancer to move more partitions at once.
I think it is the same issue link
RpkException('command /var/lib/buildkite-agent/builds/arm64-xfs-builders-i-07e7288186208f090-1/redpanda/redpanda/vbuild/release/clang/dist/local/redpanda/bin/rpk group --brokers docker-rp-27:9092,docker-rp-17:9092,docker-rp-28:9092 describe test-gr-1 returned 1, output: all 1 DescribeGroups request failures, first error: COORDINATOR_NOT_AVAILABLE: The coordinator is not available.\n', '')
                                    
                                    
                                    
                                
@jcsp, ok let's see if that helps. Meanwhile I'm trying to run the older code with retry in the test itself to verify whether the coordinator reappears in the list in case of the failure.
In my local tests, a single retry was enough to recover in all 3 instances where I have reproduced the error.
https://buildkite.com/redpanda/redpanda/builds/13690#01826ed3-0a44-4d24-8436-05555f320793/6-6670
This branch (https://github.com/redpanda-data/redpanda/pull/5841) has the PR #5610 above.
https://buildkite.com/redpanda/redpanda/builds/14056#01829136-bfdf-46c5-a75c-152ac4e2570d
I think I ended up the assignee here because of my previous change to the describe groups function, but am not actively working on it.
Seen on v22.2.x https://buildkite.com/redpanda/redpanda/builds/18771#_