redpanda
redpanda copied to clipboard
Failure in `TopicRecoveryTest.test_size_based_retention`
Build: https://buildkite.com/redpanda/redpanda/builds/10396#677124b6-8fb4-418b-bd49-d89e63578bd7
FAIL test: TopicRecoveryTest.test_size_based_retention (1/19 runs)
failure at 2022-05-23T07:38:51.539Z: AssertionError('Too much or not enough data restored, expected 10485760 got 10209301')
in job https://buildkite.com/redpanda/redpanda/builds/10396#677124b6-8fb4-418b-bd49-d89e63578bd7
Error:
test_id: rptest.tests.topic_recovery_test.TopicRecoveryTest.test_size_based_retention
--
| status: FAIL
| run time: 51.011 seconds
|
|
| AssertionError('Too much or not enough data restored, expected 10485760 got 10209301')
| Traceback (most recent call last):
| File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 135, in run
| data = self.run_test()
| File "/usr/local/lib/python3.9/dist-packages/ducktape/tests/runner_client.py", line 227, in run_test
| return self.test_context.function(self.test)
| File "/root/tests/rptest/services/cluster.py", line 35, in wrapped
| r = f(self, *args, **kwargs)
| File "/root/tests/rptest/tests/topic_recovery_test.py", line 1293, in test_size_based_retention
| self.do_run(test_case)
| File "/root/tests/rptest/tests/topic_recovery_test.py", line 1180, in do_run
| test_case.validate_cluster(baseline, restored)
| File "/root/tests/rptest/tests/topic_recovery_test.py", line 776, in validate_cluster
| assert is_close_size(size_bytes, self.restored_size_bytes), \
| AssertionError: Too much or not enough data restored, expected 10485760 got 10209301
Another instance https://buildkite.com/redpanda/redpanda/builds/10430#bf475072-ee06-4cf1-b034-d113419c57ce
https://buildkite.com/redpanda/redpanda/builds/10497#d2403fb1-cfed-4737-95b0-b71a5302541b
+1 https://buildkite.com/redpanda/redpanda/builds/10588#0180ff1a-dff2-4299-9999-249a10842283
6/97 runs failed in last 72h -- this one is quite frequent.
Again https://buildkite.com/redpanda/redpanda/builds/10689#01810e8c-1e39-4a42-b13c-0fb654cd2373
seen again https://buildkite.com/redpanda/redpanda/builds/10693#0181137f-16d2-4b57-8b92-1c7b8ff7c5ee/1561-8435
in PR #4940
Again https://buildkite.com/redpanda/redpanda/builds/10751#01811ab4-2845-486a-93d3-8649f66bc5f2
Seen again in https://buildkite.com/redpanda/redpanda/builds/10797#01811ded-c17c-4599-a7b1-10bae0e0238e/1565-8091
Again https://buildkite.com/redpanda/redpanda/builds/10876#01812326-8604-4165-a379-f580b3a8e712
Another https://buildkite.com/redpanda/redpanda/builds/10901#0181274c-bc00-496d-b823-09d36d047edc/1567-8053
one more https://buildkite.com/redpanda/redpanda/builds/10970#018137b8-8fad-403f-b8cc-2e5fce55fb60
https://buildkite.com/redpanda/redpanda/builds/11002#01813cd4-e0b3-4e92-ac3e-681fe2d6e08b
https://buildkite.com/redpanda/redpanda/builds/10998#01813c5e-1bd6-4fb7-aaad-c52d03bdca78
https://buildkite.com/redpanda/redpanda/builds/11327#018165bf-98fa-416c-95c2-3d8470ddb1a0
4/738 failures in last 30 days.
Most recent failure on dev
https://buildkite.com/redpanda/redpanda/builds/11002#01813cd4-e0b3-4e92-ac3e-681fe2d6e08b
v22.2.x https://buildkite.com/redpanda/redpanda/builds/15369#018342c9-26ad-4e53-803b-3d84e126aa8d
@ZeDRoman is helping pick this up. Thanks, Roman.
Reason of Failure:
In Shadow Indexing we have option to recover size more or equal to retention.bytes
. So Shadow Indexing would download segments until sum of their sizes become more or equal to retention.bytes
property. (partition_recovery_manager.cc
download_log_with_capped_size
)
In Disk log GC we start to delete segments if their total size more than retention.bytes
. So after GC we would have total size less or equal to retention.bytes
. (disk_log_impl.cc
size_based_gc_max_offset
)
So when they are working together we have such behavior: SI downloads segments more than retention.bytes
then Disk log GC removes one segment because total size more than retention.bytes
.
It turned out in TopicRecoveryTest.test_size_based_retention
. SI downloads segments, then segments are automatically deleted by Disk log GC, then we check that SI downloaded more than retention.bytes
and test fails (because segment was deleted).
Solution: Evgeny Lazin proposed that we need to adjust this behavior to download strictly less than retention bytes