redpanda Failure (consumer hang) in `KgoVerifierSelfTest`.`test_kgo

https://buildkite.com/redpanda/redpanda/builds/18261#01845dae-d487-4c29-9e51-74b2fc1b0ea8/6-703

Timeout waiting for consumer to finish. Consumer is hung inside its poll, based on the kgo-verifier log

[INFO  - 2022-11-09 20:41:36,253 - runner_client - log - lineno:278]: RunnerClient: rptest.tests.services_self_test.KgoVerifierSelfTest.test_kgo_verifier: Summary: TimeoutError(None)
Traceback (most recent call last): 
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs) 
  File "/root/tests/rptest/tests/services_self_test.py", line 124, in test_kgo_verifier
    seq_consumer.wait(timeout_sec=60)

Consumer started its last attempt to read at this time:

time="2022-11-09T20:40:26Z" level=info msg="Sequential read end offsets: [54 66 65 54 60 61 59 69 73 54 68 65 63 61 61 67]"

It read 524 of 1000 offsets according to its status, and then stuck there.

Consumer is asked to /last_pass at this time:

time="2022-11-09T20:40:31Z" level=info msg="Remote request /last_pass"

Nov 10 '22 23:11 jcsp

There were several leadership movements ~20 seconds after the cluster started, initiated by the leader balancer

docker-rp-15/redpanda.log:INFO  2022-11-09 20:40:07,680 [shard 0] raft - [group_id:9, {kafka/test/8}] vote_stm.cc:280 - became the leader term: 2
docker-rp-1/redpanda.log:INFO  2022-11-09 20:40:06,858 [shard 1] raft - [group_id:2, {kafka/test/1}] vote_stm.cc:280 - became the leader term: 2
docker-rp-2/redpanda.log:INFO  2022-11-09 20:40:05,959 [shard 1] raft - [group_id:1, {kafka/test/0}] vote_stm.cc:280 - became the leader term: 2
docker-rp-2/redpanda.log:INFO  2022-11-09 20:40:07,053 [shard 1] raft - [group_id:4, {kafka/test/3}] vote_stm.cc:280 - became the leader term: 2

Nov 10 '22 23:11 jcsp

The workload constructs a new kgo.Client at this time:

time="2022-11-09T20:40:26Z" level=info msg="Sequential read end offsets: [54 66 65 54 60 61 59 69 73 54 68 65 63 61 61 67]"

I see what I think is its first request to the first node land here:

TRACE 2022-11-09 20:40:26,645 [shard 1] kafka - fetch.cc:551 - handling fetch request: {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=read_uncommitted session_id=0 session_epoch=0 topics={{name={test} fetch_partitions={{partition_index=1 current_leader_epoch=2 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}, {partition_index=12 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}, {partition_index=14 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}, {partition_index=15 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id=}

Nov 11 '22 00:11 jcsp

The server is sending full fetch responses to the client from each node, and the client is coming back and doing empty fetch polls afterwards, on the face of it it looks like a possible franz-go bug.

We are clearly returning from PollFetches at least once because the consumer status is updating with a message count (524 higher than at the end of the last sequential scan). That message count does not correspond to the sum of the messages in any of the individual fetch responses, so we aren't getting called back when just one of the responses is in.

Nov 11 '22 00:11 jcsp

Switching on debug logs for the clients in this test here: https://github.com/redpanda-data/redpanda/pull/7233

That PR also updates franz-go to latest version, in case there was a client bug already fixed.

Once https://github.com/redpanda-data/kgo-verifier/pull/14 is merged there will be a PR to take it up in redpanda tests.

That will be it for the moment: there is nothing in the logs that makes this look like a Redpanda bug, and can't say more about what's going on in the client without reproductions with more debug.

Nov 11 '22 12:11 jcsp

This reproduced here: https://buildkite.com/redpanda/redpanda/builds/18567#0184773a-2bce-4e1f-929c-ced32af30a9c

Nov 15 '22 10:11 jcsp

This increasingly looks like it could be a franz-go bug.

In that last failure, I can see:

The "Calling PollFetches" log on the client shows all the consumed offsets at zero for partitions on docker-rp-6
While other nodes have served up their data and continue to receive poll fetch requests from the client while the test is waiting for it to time out, docker-rp-6 just receives a few initial fetches and then the client stops sending.
docker-rp-6 was the destination of two leadership movements, so it has a couple partitions that have leader_epoch=2 in metadata responses, although this was already the case across mulltiple earlier successful reads from the client.

Nov 15 '22 11:11 jcsp

Clearly it's surprising/interesting that this issue is coming up in the tiny smoke test rather than elsewhere (e.g. in scale tests). I wonder if it's because in this case the client is ordinarily able to fetch the entire topic in a single request, whereas tests with more data have more chances to get past some issue.

Nov 15 '22 11:11 jcsp

Let's see if this will repro with trace logging https://github.com/redpanda-data/redpanda/pull/7285

Nov 15 '22 11:11 jcsp

Time to push on getting #7285 going so we can debug this one with logs, @jcsp ?

Nov 22 '22 23:11 piyushredpanda

#7285 ran the test 300 times without failing.

We have nothing to go on, and franz-go has been upgraded since the last this happened, so closing this for now: if it reproduces we may get more information.

Nov 24 '22 13:11 jcsp

This reproduced in #8389 https://ci-artifacts.dev.vectorized.cloud/redpanda/0185e491-1357-40f9-b7cb-05b18d5b8221/vbuild/ducktape/results/2023-01-24--001/report.html

Jan 24 '23 17:01 bharathv

So, on this instance: https://ci-artifacts.dev.vectorized.cloud/redpanda/0185e491-1357-40f9-b7cb-05b18d5b8221/vbuild/ducktape/results/2023-01-24--001/report.html

It is specifically partitions on docker-rp-6 that the client is not seeing data for.

Grepping server side for requests + connections over the last few seconds before the client log goes silent:

TRACE 2023-01-24 16:47:27,586 [shard 1] kafka - server.cc:234 - kafka rpc protocol - Incoming connection from 172.16.16.20:50022 on "dnslistener"
TRACE 2023-01-24 16:47:27,622 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50022] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8126
TRACE 2023-01-24 16:47:27,662 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50022] processing name:list_offsets, key:2, version:4 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8330
TRACE 2023-01-24 16:47:27,919 [shard 0] kafka - server.cc:234 - kafka rpc protocol - Incoming connection from 172.16.16.20:50026 on "dnslistener"
TRACE 2023-01-24 16:47:27,934 [shard 0] kafka - requests.cc:92 - [172.16.16.20:50026] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8126
TRACE 2023-01-24 16:47:27,941 [shard 0] kafka - requests.cc:92 - [172.16.16.20:50026] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8526
TRACE 2023-01-24 16:47:28,053 [shard 0] kafka - requests.cc:92 - [172.16.16.20:50026] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8526
TRACE 2023-01-24 16:47:28,159 [shard 1] kafka - server.cc:234 - kafka rpc protocol - Incoming connection from 172.16.16.20:50038 on "dnslistener"
TRACE 2023-01-24 16:47:28,180 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50038] processing name:api_versions, key:18, version:3 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8126
TRACE 2023-01-24 16:47:28,204 [shard 1] kafka - requests.cc:92 - [172.16.16.20:50038] processing name:list_offsets, key:2, version:4 for KgoVerifierSeqConsumer-0-140499203778032, mem_units: 8330

The general cycle of the client workload is to open 2 connections, one of which issues a list_offsets and the other of which issues a fetch. We can see at the end that we're getting a list_offsets then no fetch. It's like the client just isn't trying to load the data.

In the client log, grepping for the logs that indicate the client getting created, and polls progressing, I see:


# Client instance A: successful

time="2023-01-24T16:47:27Z" level=info msg="Starting sequential read pass"
time="2023-01-24T16:47:27Z" level=debug msg="Calling PollFetches (last_read=[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":55000,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 0 0 0 0 72 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":55312,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 59 60 58 0 72 53 67 0 0 0 0 66] status {\"name\":\"\",\"valid_reads\":55681,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"

# Client instance B: gets hung up in PollFetches

time="2023-01-24T16:47:28Z" level=info msg="Starting sequential read pass"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":56000,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 0 0 0 0 72 0 0 0 0 0 0 0] status {\"name\":\"\",\"valid_reads\":56312,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2023-01-24T16:47:28Z" level=debug msg="PollFetches returned 1 fetches"
time="2023-01-24T16:47:28Z" level=debug msg="Calling PollFetches (last_read=[53 65 64 53 0 0 0 68 72 0 0 64 62 60 60 0] status {\"name\":\"\",\"valid_reads\":56631,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"

So that second client appears to be the one that is sending a list_offsets to the server, then not sending a fetch.

Jan 27 '23 10:01 jcsp

This is presumed to be same/similar to https://github.com/redpanda-data/redpanda/issues/6399

Jan 30 '23 13:01 jcsp

Assigning myself as the next action here is likely to be watching+waiting for reproduction with https://github.com/redpanda-data/redpanda/pull/8460

Jan 30 '23 13:01 jcsp

Travis also had a good suggestion to add a background goroutine in kgo-verifier that panics the process (to get backtraces) if the client does not complete within expected timeout, perhaps by passing the expected timeout into /last_pass)

Jan 30 '23 13:01 jcsp

sev/low because it's believed to be a client bug.

Jan 31 '23 15:01 jcsp

Another instance https://buildkite.com/redpanda/redpanda/builds/28840#01880220-7811-4922-a079-73f67ec95994

May 10 '23 02:05 ballard26

on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29136#01881e02-527f-44d1-9f09-c44f85f2f783
on (arm64, container) in job https://buildkite.com/redpanda/redpanda/builds/29102#018813b6-094b-45fd-a18b-a03cd3fb0593
on (amd64, container) in job https://buildkite.com/redpanda/redpanda/builds/29098#018813bd-c680-4458-bb79-762b0e432b93

May 16 '23 01:05 dlex

This should be gone with the revert of the fetch changes https://github.com/redpanda-data/redpanda/pull/10757

May 16 '23 08:05 jcsp

redpanda
redpanda copied to clipboard

Failure (consumer hang) in `KgoVerifierSelfTest`.`test_kgo_verifier`

redpanda redpanda copied to clipboard

Failure (consumer hang) in `KgoVerifierSelfTest`.`test_kgo_verifier`

redpanda
redpanda copied to clipboard