redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure in `ConsumerOffsetsMigrationTest` `test_migrating_consume_offsets` (`failures=True`)

Open jcsp opened this issue 2 years ago • 8 comments

FAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=3 (1/51 runs) failure at 2022-07-05T19:22:17.160Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic-kxuttrpvwb', partition=4): 96572, TopicPartition(topic='topic-kxuttrpvwb', partition=1): 33266, TopicPartition(topic='topic-kxuttrpvwb', partition=2): 97585, TopicPartition(topic='topic-kxuttrpvwb', partition=5): 136721, TopicPartition(topic='topic-kxuttrpvwb', partition=3): 79302, TopicPartition(topic='topic-kxuttrpvwb', partition=0): 18175} after waiting 180s.") in job https://buildkite.com/redpanda/redpanda/builds/12122#0181cf69-fc14-4731-85f9-10adc4019485

there are many minutes of this message:

[DEBUG - 2022-07-05 18:12:09,805 - end_to_end - has_finished_consuming - lineno:148]: waiting for partition TopicPartition(topic='topic-kxuttrpvwb', partition=4) offset 96572 to be committed, last committed offset: 24867

So this looks like it could be genuinely be a stuck consumer.

jcsp avatar Jul 06 '22 08:07 jcsp

The failure is resulted by a really rare combination of factors specific to the test. One of the nodes undergoing maintenance restart and unavailable while another node serves a fatch request results in verifiable_consumer (via KafkaConsumer) resetting consume offset to the earliest position, which in turn causes consumer worker thread to stop on an exception and unable to collect further data that will be validated at the end of the test.

jLPBR-Cs4BxxLmnyIh1AzB3aXx09jbYqrmQBeZri3OEHnZOtagYIL1BNq7_lKBJaE5ACD4KliZdap3VppOFyeZIJkigpKKc_FE4bApHeLFf4y8YIhpcxpt1LIf5M2SgJ_T-x_Q-YK5MEydhr9EG3ohVC9FRSwNuXKlHxag0dcVGaOKyIZKOc3BYwjMRXqoU8eGaNljt1aokWzM6raQ8grSdvuFtdXzXfLTipPnXw0JGwSyhWzEdhD8TykEf

Need an advice from Core on whether preventing KafkaConsumer from resetting the offset in this situation would be a meaningful idea. Otherwise VerifiableWorker should detect this case and turn it into a warning and skip the test pass.

dlex avatar Jul 27 '22 06:07 dlex

So something is happening between the fetch response where the server says hwm=44600 and the fetch response where we return "out of range" -- I'm guessing a leadership change took place and the new node somehow ended up with the wrong offset?

That shouldn't happen if we are doing acks=-1 writes -- I would suggest looking at the raft logs for the partition and seeing if the offsets are doing something weird across that leadership change.

jcsp avatar Jul 27 '22 14:07 jcsp

May be related: KAFKA-10313

dlex avatar Jul 27 '22 16:07 dlex

Again https://buildkite.com/redpanda/redpanda/builds/13301#018248b0-0c9f-4a0c-8e84-1e83e9fe3dd3

VadimPlh avatar Jul 29 '22 10:07 VadimPlh

5788 may help with more context

dlex avatar Aug 03 '22 20:08 dlex

https://buildkite.com/redpanda/redpanda/builds/13891#0182847d-fc5d-4af2-b6ca-643efbd6f6ac

dotnwat avatar Aug 10 '22 00:08 dotnwat

https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd0-4ba1-af35-0b9b4ce7f959 https://buildkite.com/redpanda/redpanda/builds/13918#01828650-7bd3-4c4e-a58c-0d79399d2070

BenPope avatar Aug 11 '22 09:08 BenPope

another instance https://buildkite.com/redpanda/redpanda/builds/14468#0182c411-5097-40bb-bca7-253bc829ce87

rystsov avatar Aug 22 '22 15:08 rystsov

This remains a rather rare failure. last 30 days: OFAIL test: ConsumerOffsetsMigrationTest.test_migrating_consume_offsets.failures=True.cpus=1 (2/769 runs) failure at 2022-10-27T07:24:08.891Z: TimeoutError("Consumer failed to consume up to offsets ... in job https://buildkite.com/redpanda/redpanda/builds/17411#018417da-00d8-435b-95c1-c905f841eec1 failure at 2022-10-05T07:57:17.159Z: TimeoutError("Consumer failed to consume up to offsets... in job https://buildkite.com/redpanda/redpanda/builds/16158#0183a6a9-a9bb-4b75-895f-ae75ef789a28

It's not clear to me that leaving this disabled is the best move, as if it does destabilize further we won't notice.

jcsp avatar Nov 04 '22 14:11 jcsp

This test will go away in https://github.com/redpanda-data/redpanda/pull/7507

jcsp avatar Nov 25 '22 10:11 jcsp