redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Franz-go sequential consumer timeout in `FranzGoVerifiableWithSiTest.test_si_without_timeboxed` and `FranzGoVerifiableWithSiTest.test_si_with_timeboxed`

Open VladLazar opened this issue 2 years ago • 1 comments

A large number of these test runs fail due to the FranzGoVerifiableSeqConsumer timing out. It's hard to tell if this is a test or a Redpanda issue as there's not much logging from the consumer. The first step would probably be to add logging of committed offsets from the sequential consumer. That way, we can tell if it's making progress and the test simply needs more time.

The latest failure with this mode is buildkite-job-3039. The test is FranzGoVerifiableWithSiTest.test_si_without_timeboxed.segment_size=104857600 (link here).

VladLazar avatar Aug 08 '22 14:08 VladLazar

It's possible that the random consumers cause segment evictions on every read from S3 (this is due to the small cache size the test is currently configured to use). The cache trashing might be leading to the timing out of the sequential consumer. https://github.com/redpanda-data/redpanda/pull/5915 increases the cache size, so let's wait until that's merged and re-evaluate these failures then.

VladLazar avatar Aug 09 '22 15:08 VladLazar

This may or may not be related, but there's a big rework of the kgo-verifier wrapper services https://github.com/redpanda-data/redpanda/pull/6059 here that should make them more robust (stop running them captively on an SSH channel, ping them for status via http instead, detect if they stop)

jcsp avatar Aug 16 '22 20:08 jcsp

There's been no failures for KgoVerifierWithSiTestLargeSegments within the last 14 days and I don't recall seeing any for longer than that. Let's close the issue and re-open if required.

VladLazar avatar Nov 03 '22 18:11 VladLazar