redpanda
redpanda copied to clipboard
Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing`
https://buildkite.com/redpanda/redpanda/builds/9985#74311a35-dfef-42fb-a669-5e752e0248a0
Module: rptest.tests.partition_movement_test
Class: SIPartitionMovementTest
Method: test_shadow_indexing
test_id: rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing
status: FAIL
run time: 2 minutes 1.397 seconds
TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.")
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
data = self.run_test()
File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
return self.test_context.function(self.test)
File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
r = f(self, *args, **kwargs)
File "/root/tests/rptest/tests/partition_movement_test.py", line 622, in test_shadow_indexing
self.run_validation(enable_idempotence=False,
File "/root/tests/rptest/tests/end_to_end.py", line 188, in run_validation
self.await_consumed_offsets(self.producer.last_acked_offsets,
File "/root/tests/rptest/tests/end_to_end.py", line 154, in await_consumed_offsets
wait_until(has_finished_consuming,
File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.
The logs also have this record:
[WARNING - 2022-05-11 07:08:19,191 - service_registry - stop_all - lineno:55]: Error stopping service <VerifiableConsumer-0-140554673463360: num_nodes: 1, nodes: ['docker-rp-12']>: VerifiableConsumer-0-140554673463360-worker-1: Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
self._worker(idx, node)
File "/root/tests/rptest/services/verifiable_consumer.py", line 266, in _worker
handler.handle_records_consumed(event, self.logger)
File "/root/tests/rptest/services/verifiable_consumer.py", line 100, in handle_records_consumed
raise AssertionError(msg)
AssertionError: Consumed from an unexpected offset (1879, 1881) for partition TopicPartition(topic='topic', partition=0)
It comes from https://github.com/redpanda-data/redpanda/blob/4e7d7dede378906cb7a79f07556fac9242155089/tests/rptest/services/verifiable_consumer.py#L97
If we look at the verifiable_producer.stdout
log we'll find:
{"timestamp":1652252831017,"name":"producer_send_success","key":null,"value":"1.1942","offset":1878,"topic":"topic","partition":0}
{"timestamp":1652252831043,"name":"producer_send_error","topic":"topic","key":null,"value":"1.1943","exception":"class org.apache.kafka.common.errors.TimeoutException","message":"The request timed out."}
...
{"timestamp":1652252836015,"name":"producer_send_error","topic":"topic","key":null,"value":"1.2454","exception":"class org.apache.kafka.common.errors.NotLeaderOrFollowerException","message":"For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition."}
{"timestamp":1652252836198,"name":"producer_send_success","key":null,"value":"1.2455","offset":1881,"topic":"topic","partition":0}
The record which should have been written with 1879 offset failed with an indecisive error (timeout). The next successfully written offset is 1881. So it seems that the test is wrong and treats timeouts as the definite error and complains when it sees 1879 instead of 1881.
Another instance
FAIL test: SIPartitionMovementTest.test_shadow_indexing (1/48 runs)
failure at 2022-05-20T07:47:03.291Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3863} after waiting 45s.")
in job https://buildkite.com/redpanda/redpanda/builds/10330#f402f79c-2888-4625-9c50-2db1d2799aa5
https://buildkite.com/redpanda/redpanda/builds/11282#018160d7-44b5-46b7-93cd-687324cc3771
https://buildkite.com/redpanda/redpanda/builds/11889#0181b400-e6c1-41ce-a1f7-34c36116670f
https://buildkite.com/redpanda/redpanda/builds/12164#0181d360-3128-4827-ba34-280a4eb7d567
https://buildkite.com/redpanda/redpanda/builds/12223#0181d751-e498-4171-9779-b8803e81096d/1556-8488
https://github.com/redpanda-data/redpanda/pull/5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue
https://buildkite.com/redpanda/redpanda/builds/13051#0182373e-6373-4809-ad96-c9af80ef161f
https://buildkite.com/redpanda/redpanda/builds/13690#01826ed3-0a49-4212-8cdb-dc348c88ff3d/6-6678
https://buildkite.com/redpanda/redpanda/builds/13731#018274bf-d8f2-4ac0-8acb-fb3e6b0f7996
Similar https://buildkite.com/redpanda/redpanda/builds/13891#0182847d-fc5d-4af2-b6ca-643efbd6f6ac
seen here https://buildkite.com/redpanda/redpanda/builds/14108#01829833-bbd7-4156-b4ee-f24ef39f3f55 on #6013
Another instance - https://buildkite.com/redpanda/redpanda/builds/14417#0182b82f-a440-4ab7-9da4-267e358d8428
When I looked into https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5 (technically #6076 but same failure mode as this issue) it looks a lot like the server really is leaving a gap in the kafka offsets. I don't think the consumer is paying any attention to what was/wasn't acked to the producer: it's just consuming the partition, and seeing the kafka offset jump forward around the same point in the log where we did a partition move.
It's not clear to me that this is valid behavior in redpanda.
Another instance https://buildkite.com/redpanda/redpanda/builds/14540#0182c846-5775-4635-b17f-6b7a9535171f
it looks a lot like the server really is leaving a gap in the kafka offsets ... It's not clear to me that this is valid behavior in redpanda.
@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"
@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"
What I saw in the consumer stdout was that its fetch responses were skipping offsets (this is in run https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5):
{"timestamp":1660999734132,"name":"records_consumed","count":2,"partitions":[{"topic":"topic","partition":0,"count":2,"minOffset":2975,"maxOffset":2976}]}
{"timestamp":1660999734153,"name":"offsets_committed","offsets":[{"topic":"topic","partition":0,"offset":2977}],"success":true}
{"timestamp":1660999738330,"name":"record_data","key":null,"value":"1.3444","topic":"topic","partition":0,"offset":2983}
{"timestamp":1660999738330,"name":"records_consumed","count":1,"partitions":[{"topic":"topic","partition":0,"count":1,"minOffset":2983,"maxOffset":2983}]}
This isn't code where it's checking producer's claims, it's just consuming the topic naively, and seeing a jump in the offsets returned by the server.
Looking on the server side, this jump corresponds with a hop between nodes.
The last contiguous read response was from docker-rp-5:
RedpandaService-0-140283699091472/docker-rp-5/redpanda.log:TRACE 2022-08-20 12:48:54,127 [shard 1] kafka - request_context.h:160 - [172.16.16.12:50054] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1073741824 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2977 last_stable_offset=2977 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 87}}}}}}}
Then on docker-rp-12 a few seconds later we see this:
TRACE 2022-08-20 12:48:58,208 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: offset_out_of_range [1] } high_watermark=-1 last_stable_offset=-1 log_start_offset=-1 aborted={nullopt} preferred_read_replica=-1 records={{size 0}}}}}}}
TRACE 2022-08-20 12:48:58,329 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2984 last_stable_offset=2984 log_start_offset=2983 aborted={nullopt} preferred_read_replica=-1 records={{size 74}}}}}}}
It looks like during the movement the log_start_offset
has got bumped up to 2983. So now this looks like the server has prefix truncated the log incorrectly.
Interesting so we have several issues hidden behind the "Consumer failed to consume up to offsets" umbrella
The root cause in this issue is exactly the same as in #6076
@Lazin #6229 also fixes this, right?