redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Failure in `TopicRecoveryTest.test_size_based_retention`

Open ZeDRoman opened this issue 2 years ago • 15 comments

Build: https://buildkite.com/redpanda/redpanda/builds/10396#677124b6-8fb4-418b-bd49-d89e63578bd7

FAIL test: TopicRecoveryTest.test_size_based_retention (1/19 runs)
  failure at 2022-05-23T07:38:51.539Z: AssertionError('Too much or not enough data restored, expected 10485760 got 10209301')
      in job https://buildkite.com/redpanda/redpanda/builds/10396#677124b6-8fb4-418b-bd49-d89e63578bd7

Error:



test_id:    rptest.tests.topic_recovery_test.TopicRecoveryTest.test_size_based_retention
--
  | status:     FAIL
  | run time:   51.011 seconds
  |  
  |  
  | AssertionError('Too much or not enough data restored, expected 10485760 got 10209301')
  | Traceback (most recent call last):
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
  | data = self.run_test()
  | File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
  | return self.test_context.function(self.test)
  | File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
  | r = f(self, *args, **kwargs)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1293, in test_size_based_retention
  | self.do_run(test_case)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 1180, in do_run
  | test_case.validate_cluster(baseline, restored)
  | File "/root/tests/rptest/tests/topic_recovery_test.py", line 776, in validate_cluster
  | assert is_close_size(size_bytes, self.restored_size_bytes), \
  | AssertionError: Too much or not enough data restored, expected 10485760 got 10209301


ZeDRoman avatar May 23 '22 15:05 ZeDRoman

Another instance https://buildkite.com/redpanda/redpanda/builds/10430#bf475072-ee06-4cf1-b034-d113419c57ce

ZeDRoman avatar May 24 '22 08:05 ZeDRoman

https://buildkite.com/redpanda/redpanda/builds/10497#d2403fb1-cfed-4737-95b0-b71a5302541b

twmb avatar May 25 '22 02:05 twmb

+1 https://buildkite.com/redpanda/redpanda/builds/10588#0180ff1a-dff2-4299-9999-249a10842283

ZeDRoman avatar May 26 '22 09:05 ZeDRoman

6/97 runs failed in last 72h -- this one is quite frequent.

jcsp avatar May 27 '22 09:05 jcsp

Again https://buildkite.com/redpanda/redpanda/builds/10689#01810e8c-1e39-4a42-b13c-0fb654cd2373

VadimPlh avatar May 29 '22 14:05 VadimPlh

seen again https://buildkite.com/redpanda/redpanda/builds/10693#0181137f-16d2-4b57-8b92-1c7b8ff7c5ee/1561-8435

in PR #4940

andrewhsu avatar May 31 '22 22:05 andrewhsu

Again https://buildkite.com/redpanda/redpanda/builds/10751#01811ab4-2845-486a-93d3-8649f66bc5f2

VadimPlh avatar Jun 01 '22 09:06 VadimPlh

Seen again in https://buildkite.com/redpanda/redpanda/builds/10797#01811ded-c17c-4599-a7b1-10bae0e0238e/1565-8091

NyaliaLui avatar Jun 01 '22 21:06 NyaliaLui

Again https://buildkite.com/redpanda/redpanda/builds/10876#01812326-8604-4165-a379-f580b3a8e712

VadimPlh avatar Jun 02 '22 09:06 VadimPlh

Another https://buildkite.com/redpanda/redpanda/builds/10901#0181274c-bc00-496d-b823-09d36d047edc/1567-8053

NyaliaLui avatar Jun 03 '22 21:06 NyaliaLui

one more https://buildkite.com/redpanda/redpanda/builds/10970#018137b8-8fad-403f-b8cc-2e5fce55fb60

ztlpn avatar Jun 06 '22 13:06 ztlpn

https://buildkite.com/redpanda/redpanda/builds/11002#01813cd4-e0b3-4e92-ac3e-681fe2d6e08b

ztlpn avatar Jun 07 '22 12:06 ztlpn

https://buildkite.com/redpanda/redpanda/builds/10998#01813c5e-1bd6-4fb7-aaad-c52d03bdca78

ajfabbri avatar Jun 07 '22 18:06 ajfabbri

https://buildkite.com/redpanda/redpanda/builds/11327#018165bf-98fa-416c-95c2-3d8470ddb1a0

BenPope avatar Jun 15 '22 20:06 BenPope

4/738 failures in last 30 days.

Most recent failure on dev https://buildkite.com/redpanda/redpanda/builds/11002#01813cd4-e0b3-4e92-ac3e-681fe2d6e08b

jcsp avatar Jul 04 '22 10:07 jcsp

v22.2.x https://buildkite.com/redpanda/redpanda/builds/15369#018342c9-26ad-4e53-803b-3d84e126aa8d

BenPope avatar Oct 11 '22 07:10 BenPope

@ZeDRoman is helping pick this up. Thanks, Roman.

piyushredpanda avatar Oct 11 '22 19:10 piyushredpanda

Reason of Failure:

In Shadow Indexing we have option to recover size more or equal to retention.bytes . So Shadow Indexing would download segments until sum of their sizes become more or equal to retention.bytes property. (partition_recovery_manager.cc download_log_with_capped_size)

In Disk log GC we start to delete segments if their total size more than retention.bytes . So after GC we would have total size less or equal to retention.bytes . (disk_log_impl.cc size_based_gc_max_offset)

So when they are working together we have such behavior: SI downloads segments more than retention.bytes then Disk log GC removes one segment because total size more than retention.bytes .

It turned out in TopicRecoveryTest.test_size_based_retention. SI downloads segments, then segments are automatically deleted by Disk log GC, then we check that SI downloaded more than retention.bytes and test fails (because segment was deleted).

Solution: Evgeny Lazin proposed that we need to adjust this behavior to download strictly less than retention bytes

ZeDRoman avatar Oct 14 '22 16:10 ZeDRoman