redpanda CI Failure in UpgradeBackToBackTest.test_upgrade_with_all

https://buildkite.com/redpanda/redpanda/builds/15263#0183381a-ce92-4a7a-9d5f-da89c6dbdbab

Module: rptest.tests.upgrade_test
Class:  UpgradeBackToBackTest
Method: test_upgrade_with_all_workloads
Arguments:
{
  "single_upgrade": false
}

test_id:    rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False
status:     FAIL
run time:   13 minutes 23.172 seconds

    TimeoutError('')
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/upgrade_test.py", line 238, in test_upgrade_with_all_workloads
    consumer.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 162, in wait_node
    wait_until(lambda: self._status.active is False or self._status_thread.
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError

Sep 14 '22 14:09 rystsov

kgo_verifier_services instucted a remote service to finish execution with last_pass but the servce didn't do it in time

[DEBUG - 2022-09-13 20:06:25,545 - kgo_verifier_services - wait_node - lineno:160]: wait_node KgoVerifierSeqConsumer-0-140483508361280: waiting for worker to complete
[INFO  - 2022-09-13 20:06:26,621 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierSeqConsumer-0-140483508361280 status: ConsumerStatus<True, 0, ValidatorStatus<29074346 0 0>>
[INFO  - 2022-09-13 20:06:26,622 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierRandomConsumer-0-140483508359072 status: ConsumerStatus<True, 0, ValidatorStatus<6008 0 0>>
[INFO  - 2022-09-13 20:06:26,657 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierConsumerGroupConsumer-0-140483512120608 status: ConsumerStatus<False, 0, ValidatorStatus<100000 0 0>>
...
[INFO  - 2022-09-13 20:16:12,107 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierRandomConsumer-0-140483508359072 status: ConsumerStatus<True, 0, ValidatorStatus<22246 0 0>>
[INFO  - 2022-09-13 20:16:17,111 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierConsumerGroupConsumer-0-140483512120608 status: ConsumerStatus<False, 0, ValidatorStatus<100000 0 0>>
[INFO  - 2022-09-13 20:16:17,111 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierRandomConsumer-0-140483508359072 status: ConsumerStatus<True, 0, ValidatorStatus<22385 0 0>>
[INFO  - 2022-09-13 20:16:17,112 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierSeqConsumer-0-140483508361280 status: ConsumerStatus<True, 0, ValidatorStatus<29074346 0 0>>
[INFO  - 2022-09-13 20:16:22,115 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierConsumerGroupConsumer-0-140483512120608 status: ConsumerStatus<False, 0, ValidatorStatus<100000 0 0>>
[INFO  - 2022-09-13 20:16:22,116 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierRandomConsumer-0-140483508359072 status: ConsumerStatus<True, 0, ValidatorStatus<22525 0 0>>
[INFO  - 2022-09-13 20:16:22,116 - kgo_verifier_services - _ingest_status - lineno:275]: Worker KgoVerifierSeqConsumer-0-140483508361280 status: ConsumerStatus<True, 0, ValidatorStatus<29074346 0 0>>
[ERROR - 2022-09-13 20:16:26,025 - cluster - wrapped - lineno:37]: Test failed, doing failure checks...

Sep 14 '22 14:09 rystsov

This isn't a bug in the verifiers if it really does just take that long -- the test should use an timeout when waiting that matches the amount of data it's going to read sequentially.

Sep 14 '22 14:09 jcsp

This smells fishy. The test isn't doing anything that takes remotely close to 10 minutes to finish.

Sep 14 '22 20:09 andrwng

Looking at some other debug test runs, the test typically completes in 3-4 minutes, and the default timeout for a service wait is 10 minutes, so I'm skeptical this is just a matter of changing timeouts.

@jcsp Any way we can get debug logs to show up in the verifier in these tests? The last messages from the sequential reader are:

...
time="2022-09-13T20:05:30Z" level=info msg="Sequential read..."
time="2022-09-13T20:05:30Z" level=info msg="Sequential start offset topic-zpsddrikhc/0 {0 0 -1 0 %!d(bool=false) %!d(bool=false)}..."
{"name":"","valid_reads":0,"invalid_reads":0,"out_of_scope_invalid_reads":0}
time="2022-09-13T20:05:30Z" level=info msg="Starting sequential read pass"
time="2022-09-13T20:05:30Z" level=info msg="Loading offsets for topic topic-zpsddrikhc t=-1..."
time="2022-09-13T20:05:30Z" level=info msg="Sequential read..."
time="2022-09-13T20:05:30Z" level=info msg="Sequential start offset topic-zpsddrikhc/0 {0 0 -1 0 %!d(bool=false) %!d(bool=false)}..."
{"name":"","valid_reads":28900001,"invalid_reads":0,"out_of_scope_invalid_reads":0}
{"name":"","valid_reads":0,"invalid_reads":0,"out_of_scope_invalid_reads":0}
time="2022-09-13T20:05:30Z" level=info msg="Starting sequential read pass"
time="2022-09-13T20:05:30Z" level=info msg="Loading offsets for topic topic-zpsddrikhc t=-1..."
time="2022-09-13T20:05:30Z" level=info msg="Sequential read..."
time="2022-09-13T20:05:30Z" level=info msg="Sequential start offset topic-zpsddrikhc/0 {0 0 -1 0 %!d(bool=false) %!d(bool=false)}..."
time="2022-09-13T20:06:25Z" level=info msg="Remote request /last_pass"

It looks like docker-rp-14 was restarted around 2022-09-13 20:05:30,852, so maybe there's correlation there.

Sep 14 '22 22:09 andrwng

@andrwng kgo-verifier takes two CLI arguments to control logging: -debug makes it output logs from the verifier code (this'll give you messages that snow when it's e.g. entering & leaving a consumer poll call. -trace gets you logs from franz-go that are useful if a client seems to be doing something really unpleasant like hanging inside a consumer. Use both flags to get both sets of logs. There's no neat toggle for it in the test, you'd just edit kgo_verifier_services.py

It's usually also helpful to inspect the server side kafka logs to track what operations were sent to the server, although in this case it's hard because the random reader is still running while the test waits for the sequential reader, so the logs are full of reads but it's hard to tell which verifier they're from. It would be good to modify the test to stop the random consumer first.

Sep 15 '22 08:09 jcsp

There's a similar sort of symptom in https://github.com/redpanda-data/redpanda/issues/6411 where it looks like a consumer is hanging.

In https://github.com/redpanda-data/redpanda/issues/6413, a consumer is getting persistent NOT_LEADER_FOR_PARTITION errors while querying offsets -- it's possible that the same thing is happening in the hanging consumers, but not visibly because the consumer is internally spinning rather than surfacing the error.

Sep 15 '22 08:09 jcsp

Another instance of the same error: https://buildkite.com/redpanda/redpanda/builds/15335#0183404e-e575-4bcf-84af-0e828cd0b868

Sep 15 '22 12:09 mmaslankaprv

Two new recent instances of the same failure: 20.09.2022: https://buildkite.com/redpanda/redpanda/builds/15501#01835958-2e47-47bf-84ea-a38e47b1ee24 17.09.2022: https://buildkite.com/redpanda/redpanda/builds/15422#018349db-92b7-49eb-9dfd-86165d7b1cf3

Sep 20 '22 08:09 VladLazar

I've added some improvements to kgo-verifier to make debugging this easier

https://github.com/redpanda-data/vtools/pull/935
https://github.com/redpanda-data/redpanda/pull/6475

Sep 20 '22 09:09 mmaslankaprv

21.09.2022: https://buildkite.com/redpanda/redpanda/builds/15516#01835a9e-f1e3-41ae-995d-04d26057c00d

Sep 21 '22 13:09 VladLazar

I had an interesting observation that when I made a change that accidentally caused redpanda to fail to start in this test[1], it manifested as a hang (buildkite run timeout) in this test.

I was adding a new configuration property to those set by default in redpanda.py and that caused old versions to fail to start

Sep 29 '22 17:09 jcsp

This CI failure seen in https://buildkite.com/redpanda/redpanda/builds/15988#01838f12-5a11-44f5-9caf-1d794dd98938 for #6565



test_id:    rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False
--
  | status:     FAIL
  | run time:   13 minutes 26.553 seconds
  |  
  |  
  | TimeoutError('')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
  | return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/upgrade_test.py", line 238, in test_upgrade_with_all_workloads
  | consumer.wait()
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 261, in wait
  | if not self.wait_node(node, end - now):
  | File "/root/tests/rptest/services/kgo_verifier_services.py", line 162, in wait_node
  | wait_until(lambda: self._status.active is False or self._status_thread.
  | File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 58, in wait_until
  | raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
  | ducktape.errors.TimeoutError

Oct 03 '22 05:10 abhijat

https://buildkite.com/redpanda/redpanda/builds/16090#0183a184-2c91-49f1-8d86-c7b9d1db5df7

Since the verbose logging went in, we can see where the consumer is hung:

time="2022-10-04T07:14:14Z" level=debug msg="Read OK (000000.000000000000033223) on p=2 at o=33223"
time="2022-10-04T07:14:14Z" level=debug msg="Calling PollFetches (last_read=[33467 8070 33223] status {\"name\":\"\",\"valid_reads\":1374763,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2022-10-04T07:15:20Z" level=info msg="Remote request /last_pass"

Partitions 0 and 2 show they are reading to their high watermark, but partition 1's last consume message is:

time="2022-10-04T07:14:13Z" level=debug msg="Read OK (000000.000000000000008070) on p=1 at o=8070"

So the consumer is stuck in PollFetches waiting for messages for partition 1 that never come for some reason.

Oct 04 '22 13:10 jcsp

I think I see the fetches for the partition right before a node restart:

# The one that gives us that first 8070 messages
TRACE 2022-10-04 07:14:13,347 [shard 1] kafka - fetch.cc:597 - handling fetch request: {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=0 session_id=0 session_epoch=0 topics={{name={topic-amyqsmjwgx} fetch_partitions={{partition_index=1 current_leader_epoch=1 fetch_offset=0 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id=}

The last log line from the node before it's killed is 07:14:13,464.

Immediately after restart, before election, we see a fetch where the client has already advanced to 33308

TRACE 2022-10-04 07:14:14,984 [shard 0] kafka - fetch.cc:548 - handling fetch request: {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=0 session_id=1073745966 session_epoch=0 topics={{name={topic-amyqsmjwgx} fetch_partitions={{partition_index=1 current_leader_epoch=1 fetch_offset=33308 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id=}

After node 2 restarts while holding leadership for paritition 1, and regains leadership after the restart, it returns a fenced_leader_epoch error to the client. But at this point the client had apparently already advanced its fetch_offset to the end of the partition, so whatever went wrong here I think went wrong at the moment the node stopped.

TRACE 2022-10-04 07:14:16,697 [shard 0] kafka - request_context.h:154 - [172.16.32.4:34638] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=5 topics={{name={topic-amyqsmjwgx} partitions={{partition_index=1 error_code={ error_code: fenced_leader_epoch [74] } high_watermark=-1 last_stable_offset=-1 log_start_offset=-1 aborted={nullopt} preferred_read_replica=-1 records={{size 0}}}}}}}

Oct 04 '22 14:10 jcsp

I am trying Noah's new ci-repeat hotness to do a repeated run of this test with franz-go debug logs here https://github.com/redpanda-data/redpanda/pull/6620

Oct 04 '22 16:10 jcsp

https://buildkite.com/redpanda/redpanda/builds/16202#0183a927-7e6c-4fa5-b4ae-35d14affac76

Looks like franz go is stuck in some metadata refresh loop:

time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[INFO] metadata update triggered; why: reload offsets from load failure
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] wrote Metadata v7; broker: 3, bytes_written: 77, write_wait: 14.813µs, time_to_write: 28.057µs, err: <nil>
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] read Metadata v7; broker: 3, bytes_read: 318, read_wait: 21.753µs, time_to_read: 143.908µs, err: <nil>
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] metadata refresh has identical topic partition data; topic: topic-cmvjwkwwwf, partition: 0, leader: 3, leader_epoch: 5
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] metadata refresh has identical topic partition data; topic: topic-cmvjwkwwwf, partition: 1, leader: 2, leader_epoch: 6
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] metadata refresh has identical topic partition data; topic: topic-cmvjwkwwwf, partition: 2, leader: 3, leader_epoch: 5
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] offsets to load broker; broker: 3, load: {map[] map[topic-cmvjwkwwwf:map[0:{0 e4 ce5} 2:{0 e4 ce5}]]}
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] handled epoch results; broker: 3, using: map[], reloading: map[topic-cmvjwkwwwf:map[0:{0 0} 2:{0 0}]]
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[INFO] metadata update triggered; why: reload offsets from load failure
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] wrote Metadata v7; broker: 3, bytes_written: 77, write_wait: 17.956µs, time_to_write: 27.153µs, err: <nil>
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] read Metadata v7; broker: 3, bytes_read: 318, read_wait: 29.387µs, time_to_read: 197.018µs, err: <nil>
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] metadata refresh has identical topic partition data; topic: topic-cmvjwkwwwf, partition: 0, leader: 3, leader_epoch: 5
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] metadata refresh has identical topic partition data; topic: topic-cmvjwkwwwf, partition: 1, leader: 2, leader_epoch: 6
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] metadata refresh has identical topic partition data; topic: topic-cmvjwkwwwf, partition: 2, leader: 3, leader_epoch: 5
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] offsets to load broker; broker: 3, load: {map[] map[topic-cmvjwkwwwf:map[0:{0 e4 ce5} 2:{0 e4 ce5}]]}
time="2022-10-05T18:08:19Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] handled epoch results; broker: 3, using: map[], reloading: map[topic-cmvjwkwwwf:map[0:{0 0} 2:{0 0}]]
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] offsets to load broker; broker: 3, load: {map[] map[topic-cmvjwkwwwf:map[0:{0 e4 ce5} 2:{33307 e4 ce5}]]}
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] offsets to load broker; broker: 2, load: {map[] map[topic-cmvjwkwwwf:map[1:{0 e3 ce6}]]}
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] offsets to load broker; broker: 3, load: {map[] map[topic-cmvjwkwwwf:map[0:{31368 e0 ce5} 2:{33307 e4 ce5}]]}
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] offsets to load broker; broker: 3, load: {map[] map[topic-cmvjwkwwwf:map[0:{0 e4 ce5} 2:{33307 e4 ce5}]]}
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] handled epoch results; broker: 3, using: map[], reloading: map[topic-cmvjwkwwwf:map[0:{0 0} 2:{0 0}]]
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] handled epoch results; broker: 3, using: map[], reloading: map[topic-cmvjwkwwwf:map[0:{0 0} 2:{0 0}]]
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] handled epoch results; broker: 2, using: map[], reloading: map[topic-cmvjwkwwwf:map[1:{0 0}]]
time="2022-10-05T18:08:20Z" name=KgoVerifierSeqConsumer-0-139840465521488[DEBUG] handled epoch results; broker: 3, using: map[], reloading: map[topic-cmvjwkwwwf:map[0:{0 0} 2:{0 0}]]

Oct 06 '22 08:10 mmaslankaprv

I note that the client hang starts while Redpanda is on version v22.1.7, which does not have the offset_for_leader epoch fix https://github.com/redpanda-data/redpanda/pull/6400 -- this is backported to v22.1.x but not yet released. Unclear if that bug could result in a hang like this, but may be a contributor.

Oct 06 '22 13:10 jcsp

The client manages to trip an offset_out_of_range by trying to fetch offset -1 here:

Server sees:

TRACE 2022-10-05 17:54:38,947 [shard 1] kafka - requests.cc:89 - [172.16.16.21:35433] processing name:fetch, key:1, version:11 for KgoVerifierSeqConsumer-0-139840465521488
INFO  2022-10-05 17:54:38,947 [shard 1] kafka - fetch_session_cache.cc:79 - removing fetch session 1073742556
INFO  2022-10-05 17:54:38,947 [shard 1] kafka - fetch_session_cache.cc:107 - fetch session created: 1073742561
TRACE 2022-10-05 17:54:38,947 [shard 1] kafka - fetch.cc:548 - handling fetch request: {replica_id=-1 max_wait_ms=5000 min_bytes=1 max_bytes=52428800 isolation_level=0 session_id=1073742556 session_epoch=0 topics={{name={topic-cmvjwkwwwf} fetch_partitions={{partition_index=1 current_leader_epoch=4 fetch_offset=-1 log_start_offset=-1 max_bytes=1048576}, {partition_index=0 current_leader_epoch=2 fetch_offset=-1 log_start_offset=-1 max_bytes=1048576}}}} forgotten={} rack_id=}
WARN  2022-10-05 17:54:38,947 [shard 1] kafka - fetch.cc:187 - fetch offset out of range for {kafka/topic-cmvjwkwwwf/0}, requested offset: -1, partition start offset: 0, high watermark: 33213

Client sees:

time="2022-10-05T17:54:38Z" name=KgoVerifierSeqConsumer-0-139840465521488[INFO] immediate metadata update triggered; why: fetch had inner topic errors: OFFSET_OUT_OF_RANGE{topic-cmvjwkwwwf[0]} NOT_LEADER_FOR_PARTITION{topic-cmvjwkwwwf[1]}

Oct 06 '22 14:10 jcsp

Here's an OffsetForLeaderEpoch sending an incorrect end_offset=-1 response (one of the things fixed in #6400)

TRACE 2022-10-05 17:54:38,941 [shard 0] kafka - request_context.h:154 - [172.16.16.21:35407] sending 23:offset_for_leader_epoch response {throttle_time_ms=0 topics={{topic={topic-cmvjwkwwwf} partitions={{error_code={ error_code: none [0] } partition=0 leader_epoch=2 end_offset=-1}, {error_code={ error_code: none [0] } partition=1 leader_epoch=4 end_offset=-1}}}}}

Oct 06 '22 14:10 jcsp

When 21.1.8 is release (currently at rc1) this issue should be fixed.

Oct 07 '22 11:10 jcsp

Hasn't reocurred since we released 21.1.8

Oct 14 '22 15:10 jcsp

I've seen this resurface on PR #7191. The failure mode is the same as the one we were seeing previously: timing out on the consumer.

https://buildkite.com/redpanda/redpanda/builds/18360#0184620d-6457-43a9-b2e0-9efdb093140b

I triggered a ci-repeat of this specific test on #7191. Let's see what happens.

Nov 10 '22 19:11 VladLazar

Looking into https://buildkite.com/redpanda/redpanda/builds/18360#0184620d-6457-43a9-b2e0-9efdb093140b:

Spotted an interesting span of time where Kafka request were routed to the wrong broker resulting in not_leader_for_partition errors.

Logs from node id 1 (docker-rp-24):

TRACE 2022-11-10 17:14:15,059 [shard 0] kafka - request_context.h:160 - [172.16.16.3:43052] sending 23:offset_for_leader_epoch response {throttle_time_ms=0 topics={{topic={topic-rqojdzmgrf} partitions={{error_code={ error_code: not_leader_for_partition [6] } partition=0 leader_epoch=-1 end_offset=-1}}}}}
...
TRACE 2022-11-10 17:15:10,746 [shard 1] kafka - request_context.h:168 - [172.16.16.3:39832] sending 23:offset_for_leader_epoch for {KgoVerifierSeqConsumer-0-140153224577712}, response {throttle_time_ms=0 topics={{topic={topic-rqojdzmgrf} partitions={{error_code={ error_code: not_leader_for_partition [6] } partition=0 leader_epoch=-1 end_offset=-1}}}}}

Looks like node 1 believed that node 2 was the leader:

TRACE 2022-11-10 17:15:08,731 [shard 0] kafka - request_context.h:168 - [172.16.16.3:39696] sending 3:metadata for {KgoVerifierSeqConsumer-0-140153224577712}, response {throttle_time_ms=0 brokers={{node_id=3 host=docker-rp-22 port=9092 rack={nullopt}}, {node_id=2 host=docker-rp-20 port=9092 rack={nullopt}}, {node_id=1 host=docker-rp-24 port=9092 rack={nullopt}}} cluster_id={redpanda.446613b5-c301-49fb-b5de-002f35c2a983} controller_id=3 topics={{error_code={ error_code: none [0] } name={topic-rqojdzmgrf} is_internal=false partitions={{error_code={ error_code: none [0] } partition_index=0 leader_id=3 leader_epoch=3 replica_nodes={{2}, {3}, {1}} isr_nodes={{2}, {3}, {1}} offline_replicas={}}} topic_authorized_operations=0}} cluster_authorized_operations=-2147483648}

After requests for offsets stop (`17:15:10,746), the test seems to spin. I would look at the kgoverifier sequential consumer logs and the logs on node 2.

Nov 10 '22 19:11 VladLazar

The kgoverifier sequential consumer logs stop at 17:15, but node 2 continues to receive requests from the consumer and logs messages like:

TRACE 2022-11-10 17:15:19,514 [shard 0] kafka - request_context.h:168 - [172.16.16.3:46290] sending 1:fetch for {KgoVerifierSeqConsumer-0-140153224577712}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=12 topics={}}

The logs on node 2 continue like this until the test exits.

I'm not sure if it's normal or not, but I do see some session_id values being quite large:

TRACE 2022-11-10 17:15:20,076 [shard 1] kafka - request_context.h:168 - [172.16.16.3:46366] sending 1:fetch for {KgoVerifierSeqConsumer-0-140153224577712}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1073741839 topics={}}

Just noting some differences in comparing these messages to earlier, where there are actual values for the topics field:

TRACE 2022-11-10 17:15:14,298 [shard 0] kafka - request_context.h:168 - [172.16.16.3:46442] sending 1:fetch for {KgoVerifierSeqConsumer-0-140153224577712}, response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=20 topics={{name={topic-rqojdzmgrf} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=100000 last_stable_offset=100000 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 52280}}}}}}}

Still investigating.

Nov 10 '22 22:11 andrwng

This looks like #7231, I see in the seq consumer logs we're also hanging at PollFetches():

time="2022-11-10T17:15:14Z" level=debug msg="Sequential read topic-rqojdzmgrf/0 o=99999..."
time="2022-11-10T17:15:14Z" level=debug msg="Consumed 000000.000000000000099999 on p=0 at o=99999"
time="2022-11-10T17:15:14Z" level=debug msg="Read OK (000000.000000000000099999) on p=0 at o=99999"
time="2022-11-10T17:15:14Z" level=info msg="Sequential read complete up to [99999] (validator status {\"name\":\"\",\"valid_reads\":3900000,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2022-11-10T17:15:14Z" level=info msg="Starting sequential read pass"
time="2022-11-10T17:15:14Z" level=info msg="Loading offsets for topic topic-rqojdzmgrf t=-1..."
time="2022-11-10T17:15:14Z" level=debug msg="Partition 0 offset 100000"
time="2022-11-10T17:15:14Z" level=info msg="Sequential read start offsets: [0]"
time="2022-11-10T17:15:14Z" level=info msg="Sequential read end offsets: [100000]"
time="2022-11-10T17:15:14Z" level=info msg="Sequential start offset topic-rqojdzmgrf/0  kgo.Offset{at:0, relative:0, epoch:-1, currentEpoch:0, noReset:false, afterMilli:false}..."
time="2022-11-10T17:15:14Z" level=debug msg="Calling PollFetches (last_read=[0] status {\"name\":\"\",\"valid_reads\":3900000,\"invalid_reads\":0,\"out_of_scope_invalid_reads\":0})"
time="2022-11-10T17:15:18Z" level=info msg="Remote request /last_pass"

That helps rule out the upgrades being at play here.

Nov 11 '22 01:11 andrwng

No instances of this on docker CI on dev in last 30 days.

Vlad saw it in a PR run 15 days ago.

We closed https://github.com/redpanda-data/redpanda/issues/7231 for a similar consumer hang because it refuses to reproduce across hundreds of runs, and there has been a franz-go update in the meantime. The log here also shows a client stuck in PollFetches.

Nov 25 '22 10:11 jcsp

Since we haven't seen this since updating franz-go/kgo-verifier, I'm closing this. The update has slightly better diagnosability (doesn't leave extra consumers running in background until they'd GC'd), so if this reoccurs we'll hopefully have better logs to work with.

Nov 25 '22 10:11 jcsp

Ran into a new instance of the failure (same timeout) in a CI run for https://github.com/redpanda-data/redpanda/pull/7706. I doubt this has to do with the PR change (S3 client refactor) as the cluster config sets cloud_storage_enabled: false.

https://buildkite.com/redpanda/redpanda/builds/19743#01851006-c605-4b5a-a46e-410474e3c886

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 476, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/upgrade_test.py", line 256, in test_upgrade_with_all_workloads
    consumer.wait()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/services/service.py", line 261, in wait
    if not self.wait_node(node, end - now):
  File "/root/tests/rptest/services/kgo_verifier_services.py", line 162, in wait_node
    self._redpanda.wait_until(lambda: self._status.active is False or self.
  File "/root/tests/rptest/services/redpanda.py", line 1072, in wait_until
    wait_until(wrapped,
  File "/usr/local/lib/python3.10/dist-packages/ducktape/utils/util.py", line 57, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: None

Dec 14 '22 11:12 VladLazar

sev/low because it's believed to be a client bug.

Jan 31 '23 15:01 jcsp

https://buildkite.com/redpanda/redpanda/builds/27356#018793a0-baaf-4f59-94e9-68ca41a86570

Apr 18 '23 20:04 dotnwat

redpanda
redpanda copied to clipboard

CI Failure in UpgradeBackToBackTest.test_upgrade_with_all_workloads

redpanda redpanda copied to clipboard

CI Failure in UpgradeBackToBackTest.test_upgrade_with_all_workloads

redpanda
redpanda copied to clipboard