redpanda Scale test for recovery from S3

Create a scale test which exercises cluster recovery from S3 with larger amounts of data and partitions.

Note: Based on #5667, so ignore duplicate commits here until that is merged.

Aug 03 '22 18:08 ajfabbri

@ajfabbri looks like a merge conflict

Aug 14 '22 18:08 dotnwat

Force-push:

Latest test code, and fixing merge conflicts.
Remove scale test suite.. work towards getting into nightlies for now.
Parallelize file checksum computation (was ~30% of test runtime)

Aug 15 '22 03:08 ajfabbri

This should be ready to merge now.

Force-push: rebase on latest dev, and tune scale smaller for nightly runs (at least until we have another schedule with longer period).

Now that segment checksumming has been parallelized (somewhat), the longest contributor to test runtime appears to be waiting for all the segment s3 objects' metadata to be updated. In the upload bandwidth graph below, you can see the current test takes a bit less than 10 minutes to upload all the segments:

However, if you watch the test debug log while it is running, you can see that the sum of lengths of s3 objects doesn't add up to the expected total bytes for multiple minutes after that.

Aug 17 '22 01:08 ajfabbri

:shrug: Failed in CI. Looking into it.

Aug 19 '22 02:08 ajfabbri

Force-push: Rebase on latest dev, and remove backoff time tweak that worked nice for scale tests, but caused failure for the unit tests, since the _wait_for_data_in_s3() loop requires 6 consecutive measurements to succeed, and a bigger backoff makes that less likely within the given timeout.

Aug 19 '22 05:08 ajfabbri

/ci-repeat 5

Aug 19 '22 05:08 ajfabbri

Force push: keep timeout constant as is for topic recovery tests, while still allowing scale test to pass in larger timeouts.

Aug 19 '22 05:08 ajfabbri

/ci-repeat 5

Aug 19 '22 05:08 ajfabbri

Some unrelated failures in the 5x CI run. All release builds passed, but 3/5 debug builds saw unrelated failures:

Build 1/5 saw #5608
Another run saw #4702
This run hit #5886

Aug 19 '22 19:08 ajfabbri

Force-push: Rebase on latest dev and fix conflicts. Address two nits from @abhijat

Aug 23 '22 23:08 ajfabbri

Force push: address nit (empty python method--we should add this to our linter).. CI failure is k8s operator (not related).

Sep 06 '22 21:09 ajfabbri

LGTM One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.

Thank you @Lazin .. I plan on doing some more refactoring on the test in future (to avoid subclassing the main recovery test directly).. and I think this is a good idea.

Sep 07 '22 03:09 ajfabbri

k8s operator test is failing again

Sep 12 '22 11:09 Lazin

Force-push: rebase to latest dev in hopes of resolving k8s tCI est issue.

Sep 13 '22 00:09 ajfabbri

redpanda redpanda copied to clipboard

Scale test for recovery from S3

redpanda
redpanda copied to clipboard