redpanda
redpanda copied to clipboard
Failure (consumer hang) in `KgoVerifierSelfTest`.`test_kgo_verifier`
https://buildkite.com/redpanda/redpanda/builds/18261#01845dae-d487-4c29-9e51-74b2fc1b0ea8/6-703
Timeout waiting for consumer to finish. Consumer is hung inside its poll, based on the kgo-verifier log
[INFO - 2022-11-09 20:41:36,253 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.services_self_test.KgoVerifierSelfTest.test_kgo_verifier: Summary: TimeoutError(None)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/services_self_test.py", line 124, in test_kgo_verifier
seq_consumer.wait(timeout_sec=60)
Consumer started its last attempt to read at this time:
time="2022-11-09T20:40:26Z" level=info msg="Sequential read end offsets: [54 66 65 54 60 61 59 69 73 54 68 65 63 61 61 67]"
It read 524 of 1000 offsets according to its status, and then stuck there.
Consumer is asked to /last_pass at this time:
time="2022-11-09T20:40:31Z" level=info msg="Remote request /last_pass"
There were several leadership movements ~20 seconds after the cluster started, initiated by the leader balancer
docker-rp-15/redpanda.log:INFO 2022-11-09 20:40:07,680 [shard 0] raft - [group_id:9, {kafka/test/8}] vote_stm.cc:280 - became the leader term: 2
docker-rp-1/redpanda.log:INFO 2022-11-09 20:40:06,858 [shard 1] raft - [group_id:2, {kafka/test/1}] vote_stm.cc:280 - became the leader term: 2
docker-rp-2/redpanda.log:INFO 2022-11-09 20:40:05,959 [shard 1] raft - [group_id:1, {kafka/test/0}] vote_stm.cc:280 - became the leader term: 2
docker-rp-2/redpanda.log:INFO 2022-11-09 20:40:07,053 [shard 1] raft - [group_id:4, {kafka/test/3}] vote_stm.cc:280 - became the leader term: 2
The workload constructs a new kgo.Client at this time:
time="2022-11-09T20:40:26Z" level=info msg="Sequential read end offsets: [54 66 65 54 60 61 59 69 73 54 68 65 63 61 61 67]"
I see what I think is its first request to the first node land here:
TRACE 2022-11-09 20:40:26,645 [shard 1] kafka - fetch.cc:551 - handling fetch request: {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=0 session_epoch=0 topics={{name={test} fetch_partitions={{partition_index=1 current_leader_epoch=2 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}, {partition_index=12 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}, {partition_index=14 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}, {partition_index=15 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id=}
The server is sending full fetch responses to the client from each node, and the client is coming back and doing empty fetch polls afterwards, on the face of it it looks like a possible franz-go bug.
We are clearly returning from PollFetches at least once because the consumer status is updating with a message count (524 higher than at the end of the last sequential scan). That message count does not correspond to the sum of the messages in any of the individual fetch responses, so we aren't getting called back when just one of the responses is in.
Switching on debug logs for the clients in this test here: https://github.com/redpanda-data/redpanda/pull/7233
That PR also updates franz-go to latest version, in case there was a client bug already fixed.
Once https://github.com/redpanda-data/kgo-verifier/pull/14 is merged there will be a PR to take it up in redpanda tests.
That will be it for the moment: there is nothing in the logs that makes this look like a Redpanda bug, and can't say more about what's going on in the client without reproductions with more debug.
This reproduced here: https://buildkite.com/redpanda/redpanda/builds/18567#0184773a-2bce-4e1f-929c-ced32af30a9c
This increasingly looks like it could be a franz-go bug.
In that last failure, I can see:
- The "Calling PollFetches" log on the client shows all the consumed offsets at zero for partitions on docker-rp-6
- While other nodes have served up their data and continue to receive poll fetch requests from the client while the test is waiting for it to time out, docker-rp-6 just receives a few initial fetches and then the client stops sending.
- docker-rp-6 was the destination of two leadership movements, so it has a couple partitions that have leader_epoch=2 in metadata responses, although this was already the case across mulltiple earlier successful reads from the client.
Clearly it's surprising/interesting that this issue is coming up in the tiny smoke test rather than elsewhere (e.g. in scale tests). I wonder if it's because in this case the client is ordinarily able to fetch the entire topic in a single request, whereas tests with more data have more chances to get past some issue.
Let's see if this will repro with trace logging https://github.com/redpanda-data/redpanda/pull/7285
Time to push on getting #7285 going so we can debug this one with logs, @jcsp ?
#7285 ran the test 300 times without failing.
We have nothing to go on, and franz-go has been upgraded since the last this happened, so closing this for now: if it reproduces we may get more information.
This reproduced in #8389 https://ci-artifacts.dev.vectorized.cloud/redpanda/0185e491-1357-40f9-b7cb-05b18d5b8221/vbuild/ducktape/results/2023-01-24--001/report.html
So, on this instance: https://ci-artifacts.dev.vectorized.cloud/redpanda/0185e491-1357-40f9-b7cb-05b18d5b8221/vbuild/ducktape/results/2023-01-24--001/report.html
It is specifically partitions on docker-rp-6 that the client is not seeing data for.
Grepping server side for requests + connections over the last few seconds before the client log goes silent:
TRACE 2023-01-24 16:47:27,586 [shard 1] kafka - server.cc:234 - kafka rpc protocol - Incoming connection from 172.16.16.20:50022 on "dnslistener"
TRACE 2023-01-24 16:47:27,622 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50022] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8126
TRACE 2023-01-24 16:47:27,662 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50022] processing name:list_offsets, key:2, version:4 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8330
TRACE 2023-01-24 16:47:27,919 [shard 0] kafka - server.cc:234 - kafka rpc protocol - Incoming connection from 172.16.16.20:50026 on "dnslistener"
TRACE 2023-01-24 16:47:27,934 [shard 0] kafka - requests.cc:92 - [172.16.16.20:50026] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8126
TRACE 2023-01-24 16:47:27,941 [shard 0] kafka - requests.cc:92 - [172.16.16.20:50026] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8526
TRACE 2023-01-24 16:47:28,053 [shard 0] kafka - requests.cc:92 - [172.16.16.20:50026] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8526
TRACE 2023-01-24 16:47:28,159 [shard 1] kafka - server.cc:234 - kafka rpc protocol - Incoming connection from 172.16.16.20:50038 on "dnslistener"
TRACE 2023-01-24 16:47:28,180 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50038] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8126
TRACE 2023-01-24 16:47:28,204 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50038] processing name:list_offsets, key:2, version:4 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8330
The general cycle of the client workload is to open 2 connections, one of which issues a list_offsets and the other of which issues a fetch. We can see at the end that we're getting a list_offsets then no fetch. It's like the client just isn't trying to load the data.
In the client log, grepping for the logs that indicate the client getting created, and polls progressing, I see:
# Client instance A: successful
time="2023-01-24T16:47:27Z" level=info msg="Starting sequential read pass"
time="2023-01-24T16:47:27Z" level=debug msg="Calling PollFetches (last_read=[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":55000,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 0 0 0 0 72 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":55312,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 59 60 58 0 72 53 67 0 0 0 0 66] status {\"name\":\"\",\"valid_reads\":55681,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
# Client instance B: gets hung up in PollFetches
time="2023-01-24T16:47:28Z" level=info msg="Starting sequential read pass"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":56000,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 0 0 0 0 72 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":56312,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 0 0 0 68 72 0 0 64 62 60 60 0] status {\"name\":\"\",\"valid_reads\":56631,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
So that second client appears to be the one that is sending a list_offsets to the server, then not sending a fetch.
This is presumed to be same/similar to https://github.com/redpanda-data/redpanda/issues/6399
Assigning myself as the next action here is likely to be watching+waiting for reproduction with https://github.com/redpanda-data/redpanda/pull/8460
Travis also had a good suggestion to add a background goroutine in kgo-verifier that panics the process (to get backtraces) if the client does not complete within expected timeout, perhaps by passing the expected timeout into /last_pass)
sev/low because it's believed to be a client bug.
Another instance https://buildkite.com/redpanda/redpanda/builds/28840#01880220-7811-4922-a079-73f67ec95994
- on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29136#01881e02-527f-44d1-9f09-c44f85f2f783
- on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29102#018813b6-094b-45fd-a18b-a03cd3fb0593
- on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29098#018813bd-c680-4458-bb79-762b0e432b93
This should be gone with the revert of the fetch changes https://github.com/redpanda-data/redpanda/pull/10757