redpanda
redpanda copied to clipboard
Failure in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` (`failures=True`)
FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=3 (1/51 runs) failure at 2022-07-05T19:22:17.160Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic-kxuttrpvwb', partition=4): 96572, TopicPartition(topic='topic-kxuttrpvwb', partition=1): 33266, TopicPartition(topic='topic-kxuttrpvwb', partition=2): 97585, TopicPartition(topic='topic-kxuttrpvwb', partition=5): 136721, TopicPartition(topic='topic-kxuttrpvwb', partition=3): 79302, TopicPartition(topic='topic-kxuttrpvwb', partition=0): 18175} after waiting 180s.") in job https://buildkite.com/redpanda/redpanda/builds/12122#0181cf69-fc14-4731-85f9-10adc4019485
there are many minutes of this message:
[DEBUG - 2022-07-05 18:12:09,805 - end_to_end - has_finished_consuming - lineno:148]: waiting for partition TopicPartition(topic='topic-kxuttrpvwb', partition=4) offset 96572 to be committed, last committed offset: 24867
So this looks like it could be genuinely be a stuck consumer.
The failure is resulted by a really rare combination of factors specific to the test. One of the nodes undergoing maintenance restart and unavailable while another node serves a fatch request results in verifiable_consumer
(via KafkaConsumer) resetting consume offset to the earliest position, which in turn causes consumer worker thread to stop on an exception and unable to collect further data that will be validated at the end of the test.
Need an advice from Core on whether preventing KafkaConsumer
from resetting the offset in this situation would be a meaningful idea. Otherwise VerifiableWorker
should detect this case and turn it into a warning and skip the test pass.
So something is happening between the fetch response where the server says hwm=44600 and the fetch response where we return "out of range" -- I'm guessing a leadership change took place and the new node somehow ended up with the wrong offset?
That shouldn't happen if we are doing acks=-1 writes -- I would suggest looking at the raft logs for the partition and seeing if the offsets are doing something weird across that leadership change.
May be related: KAFKA-10313
Again https://buildkite.com/redpanda/redpanda/builds/13301#018248b0-0c9f-4a0c-8e84-1e83e9fe3dd3
5788 may help with more context
https://buildkite.com/redpanda/redpanda/builds/13891#0182847d-fc5d-4af2-b6ca-643efbd6f6ac
https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd0-4ba1-af35-0b9b4ce7f959 https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd3-4c4e-a58c-0d79399d2070
another instance https://buildkite.com/redpanda/redpanda/builds/14468#0182c411-5097-40bb-bca7-253bc829ce87
This remains a rather rare failure. last 30 days: OFAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=1 (2/769 runs) failure at 2022-10-27T07:24:08.891Z: TimeoutError("Consumer failed to consume up to offsets ... in job https://buildkite.com/redpanda/redpanda/builds/17411#018417da-00d8-435b-95c1-c905f841eec1 failure at 2022-10-05T07:57:17.159Z: TimeoutError("Consumer failed to consume up to offsets... in job https://buildkite.com/redpanda/redpanda/builds/16158#0183a6a9-a9bb-4b75-895f-ae75ef789a28
It's not clear to me that leaving this disabled is the best move, as if it does destabilize further we won't notice.
This test will go away in https://github.com/redpanda-data/redpanda/pull/7507