redpanda
redpanda copied to clipboard
Franz-go sequential consumer timeout in `FranzGoVerifiableWithSiTest.test_si_without_timeboxed` and `FranzGoVerifiableWithSiTest.test_si_with_timeboxed`
A large number of these test runs fail due to the FranzGoVerifiableSeqConsumer timing out. It's hard to tell if this is a test or a Redpanda issue as there's not much logging from the consumer. The first step would probably be to add logging of committed offsets from the sequential consumer. That way, we can tell if it's making progress and the test simply needs more time.
The latest failure with this mode is buildkite-job-3039. The test is FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600
(link here).
It's possible that the random consumers cause segment evictions on every read from S3 (this is due to the small cache size the test is currently configured to use). The cache trashing might be leading to the timing out of the sequential consumer. https://github.com/redpanda-data/redpanda/pull/5915 increases the cache size, so let's wait until that's merged and re-evaluate these failures then.
This may or may not be related, but there's a big rework of the kgo-verifier wrapper services https://github.com/redpanda-data/redpanda/pull/6059 here that should make them more robust (stop running them captively on an SSH channel, ping them for status via http instead, detect if they stop)
There's been no failures for KgoVerifierWithSiTestLargeSegments
within the last 14 days and I don't recall seeing any for longer than that. Let's close the issue and re-open if required.