redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Consumer failed to consume up to offsets in `SIPartitionMovementTest.test_shadow_indexing`

Open rystsov opened this issue 2 years ago • 11 comments

https://buildkite.com/redpanda/redpanda/builds/9985#74311a35-dfef-42fb-a669-5e752e0248a0

Module: rptest.tests.partition_movement_test
Class:  SIPartitionMovementTest
Method: test_shadow_indexing
test_id:    rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing
status:     FAIL
run time:   2 minutes 1.397 seconds

    TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.")
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
    data = self.run_test()
  File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
    r = f(self, *args, **kwargs)
  File "/root/tests/rptest/tests/partition_movement_test.py", line 622, in test_shadow_indexing
    self.run_validation(enable_idempotence=False,
  File "/root/tests/rptest/tests/end_to_end.py", line 188, in run_validation
    self.await_consumed_offsets(self.producer.last_acked_offsets,
  File "/root/tests/rptest/tests/end_to_end.py", line 154, in await_consumed_offsets
    wait_until(has_finished_consuming,
  File "/usr/local/lib/python3.9/dist-packages/ducktape/utils/util.py", line 58, in wait_until
    raise TimeoutError(err_msg() if callable(err_msg) else err_msg) from last_exception
ducktape.errors.TimeoutError: Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3271} after waiting 45s.

rystsov avatar May 11 '22 19:05 rystsov

The logs also have this record:

[WARNING - 2022-05-11 07:08:19,191 - service_registry - stop_all - lineno:55]: Error stopping service <VerifiableConsumer-0-140554673463360: num_nodes: 1, nodes: ['docker-rp-12']>: VerifiableConsumer-0-140554673463360-worker-1: Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/ducktape/services/background_thread.py", line 38, in _protected_worker
    self._worker(idx, node)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 266, in _worker
    handler.handle_records_consumed(event, self.logger)
  File "/root/tests/rptest/services/verifiable_consumer.py", line 100, in handle_records_consumed
    raise AssertionError(msg)
AssertionError: Consumed from an unexpected offset (1879, 1881) for partition TopicPartition(topic='topic', partition=0)

It comes from https://github.com/redpanda-data/redpanda/blob/4e7d7dede378906cb7a79f07556fac9242155089/tests/rptest/services/verifiable_consumer.py#L97

If we look at the verifiable_producer.stdout log we'll find:

{"timestamp":1652252831017,"name":"producer_send_success","key":null,"value":"1.1942","offset":1878,"topic":"topic","partition":0}
{"timestamp":1652252831043,"name":"producer_send_error","topic":"topic","key":null,"value":"1.1943","exception":"class org.apache.kafka.common.errors.TimeoutException","message":"The request timed out."}
...
{"timestamp":1652252836015,"name":"producer_send_error","topic":"topic","key":null,"value":"1.2454","exception":"class org.apache.kafka.common.errors.NotLeaderOrFollowerException","message":"For requests intended only for the leader, this error indicates that the broker is not the current leader. For requests intended for any replica, this error indicates that the broker is not a replica of the topic partition."}
{"timestamp":1652252836198,"name":"producer_send_success","key":null,"value":"1.2455","offset":1881,"topic":"topic","partition":0}

The record which should have been written with 1879 offset failed with an indecisive error (timeout). The next successfully written offset is 1881. So it seems that the test is wrong and treats timeouts as the definite error and complains when it sees 1879 instead of 1881.

rystsov avatar May 11 '22 19:05 rystsov

Another instance

FAIL test: SIPartitionMovementTest.test_shadow_indexing (1/48 runs)
  failure at 2022-05-20T07:47:03.291Z: TimeoutError("Consumer failed to consume up to offsets {TopicPartition(topic='topic', partition=0): 3863} after waiting 45s.")
      in job https://buildkite.com/redpanda/redpanda/builds/10330#f402f79c-2888-4625-9c50-2db1d2799aa5

dimitriscruz avatar May 20 '22 23:05 dimitriscruz

https://buildkite.com/redpanda/redpanda/builds/11282#018160d7-44b5-46b7-93cd-687324cc3771

BenPope avatar Jun 14 '22 15:06 BenPope

https://buildkite.com/redpanda/redpanda/builds/11889#0181b400-e6c1-41ce-a1f7-34c36116670f

mmaslankaprv avatar Jun 30 '22 12:06 mmaslankaprv

https://buildkite.com/redpanda/redpanda/builds/12164#0181d360-3128-4827-ba34-280a4eb7d567

VladLazar avatar Jul 07 '22 09:07 VladLazar

https://buildkite.com/redpanda/redpanda/builds/12223#0181d751-e498-4171-9779-b8803e81096d/1556-8488

NyaliaLui avatar Jul 07 '22 19:07 NyaliaLui

https://github.com/redpanda-data/redpanda/pull/5238 introduces a new workload & online verifier which isn't subject to such errors when it's it we switch to it and it will fix this issue

rystsov avatar Jul 08 '22 21:07 rystsov

https://buildkite.com/redpanda/redpanda/builds/13051#0182373e-6373-4809-ad96-c9af80ef161f

bharathv avatar Jul 25 '22 23:07 bharathv

https://buildkite.com/redpanda/redpanda/builds/13690#01826ed3-0a49-4212-8cdb-dc348c88ff3d/6-6678

twmb avatar Aug 05 '22 21:08 twmb

https://buildkite.com/redpanda/redpanda/builds/13731#018274bf-d8f2-4ac0-8acb-fb3e6b0f7996

BenPope avatar Aug 08 '22 08:08 BenPope

Similar https://buildkite.com/redpanda/redpanda/builds/13891#0182847d-fc5d-4af2-b6ca-643efbd6f6ac

dotnwat avatar Aug 10 '22 00:08 dotnwat

seen here https://buildkite.com/redpanda/redpanda/builds/14108#01829833-bbd7-4156-b4ee-f24ef39f3f55 on #6013

abhijat avatar Aug 17 '22 05:08 abhijat

Another instance - https://buildkite.com/redpanda/redpanda/builds/14417#0182b82f-a440-4ab7-9da4-267e358d8428

rystsov avatar Aug 20 '22 14:08 rystsov

When I looked into https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5 (technically #6076 but same failure mode as this issue) it looks a lot like the server really is leaving a gap in the kafka offsets. I don't think the consumer is paying any attention to what was/wasn't acked to the producer: it's just consuming the partition, and seeing the kafka offset jump forward around the same point in the log where we did a partition move.

It's not clear to me that this is valid behavior in redpanda.

jcsp avatar Aug 20 '22 14:08 jcsp

Another instance https://buildkite.com/redpanda/redpanda/builds/14540#0182c846-5775-4635-b17f-6b7a9535171f

rystsov avatar Aug 23 '22 04:08 rystsov

it looks a lot like the server really is leaving a gap in the kafka offsets ... It's not clear to me that this is valid behavior in redpanda.

@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"

rystsov avatar Aug 23 '22 04:08 rystsov

@jcsp what made you think there are gaps? If the verifiable producer treats indecisive errors as definite errors then we'll get a verifiable consumer thinking that there is a gap while in reality all the offsets are monotonic and when the reality collides with expectations we get "Consumer failed to consume up to offsets"

What I saw in the consumer stdout was that its fetch responses were skipping offsets (this is in run https://buildkite.com/redpanda/redpanda/builds/14432#0182baf4-737e-4b49-9326-d58fe8abb2a5):

{"timestamp":1660999734132,"name":"records_consumed","count":2,"partitions":[{"topic":"topic","partition":0,"count":2,"minOffset":2975,"maxOffset":2976}]}
{"timestamp":1660999734153,"name":"offsets_committed","offsets":[{"topic":"topic","partition":0,"offset":2977}],"success":true}
{"timestamp":1660999738330,"name":"record_data","key":null,"value":"1.3444","topic":"topic","partition":0,"offset":2983}
{"timestamp":1660999738330,"name":"records_consumed","count":1,"partitions":[{"topic":"topic","partition":0,"count":1,"minOffset":2983,"maxOffset":2983}]}

This isn't code where it's checking producer's claims, it's just consuming the topic naively, and seeing a jump in the offsets returned by the server.

Looking on the server side, this jump corresponds with a hop between nodes.

The last contiguous read response was from docker-rp-5:

RedpandaService-0-140283699091472/docker-rp-5/redpanda.log:TRACE 2022-08-20 12:48:54,127 [shard 1] kafka - request_context.h:160 - [172.16.16.12:50054] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1073741824 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2977 last_stable_offset=2977 log_start_offset=0 aborted={nullopt} preferred_read_replica=-1 records={{size 87}}}}}}}

Then on docker-rp-12 a few seconds later we see this:

TRACE 2022-08-20 12:48:58,208 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: offset_out_of_range [1] } high_watermark=-1 last_stable_offset=-1 log_start_offset=-1 aborted={nullopt} preferred_read_replica=-1 records={{size 0}}}}}}}
TRACE 2022-08-20 12:48:58,329 [shard 0] kafka - request_context.h:160 - [172.16.16.12:56206] sending 1:fetch response {throttle_time_ms=0 error_code={ error_code: none [0] } session_id=1 topics={{name={topic} partitions={{partition_index=0 error_code={ error_code: none [0] } high_watermark=2984 last_stable_offset=2984 log_start_offset=2983 aborted={nullopt} preferred_read_replica=-1 records={{size 74}}}}}}}

It looks like during the movement the log_start_offset has got bumped up to 2983. So now this looks like the server has prefix truncated the log incorrectly.

jcsp avatar Aug 23 '22 08:08 jcsp

Interesting so we have several issues hidden behind the "Consumer failed to consume up to offsets" umbrella

rystsov avatar Aug 23 '22 20:08 rystsov

The root cause in this issue is exactly the same as in #6076

mmaslankaprv avatar Aug 25 '22 10:08 mmaslankaprv

@Lazin #6229 also fixes this, right?

jcsp avatar Aug 25 '22 17:08 jcsp