redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Scale test for recovery from S3

Open ajfabbri opened this issue 3 years ago • 0 comments

Create a scale test which exercises cluster recovery from S3 with larger amounts of data and partitions.

Note: Based on #5667, so ignore duplicate commits here until that is merged.

ajfabbri avatar Aug 03 '22 18:08 ajfabbri

@ajfabbri looks like a merge conflict

dotnwat avatar Aug 14 '22 18:08 dotnwat

Force-push:

  • Latest test code, and fixing merge conflicts.
  • Remove scale test suite.. work towards getting into nightlies for now.
  • Parallelize file checksum computation (was ~30% of test runtime)

ajfabbri avatar Aug 15 '22 03:08 ajfabbri

This should be ready to merge now.

Force-push: rebase on latest dev, and tune scale smaller for nightly runs (at least until we have another schedule with longer period).

Now that segment checksumming has been parallelized (somewhat), the longest contributor to test runtime appears to be waiting for all the segment s3 objects' metadata to be updated. In the upload bandwidth graph below, you can see the current test takes a bit less than 10 minutes to upload all the segments:

image

However, if you watch the test debug log while it is running, you can see that the sum of lengths of s3 objects doesn't add up to the expected total bytes for multiple minutes after that.

ajfabbri avatar Aug 17 '22 01:08 ajfabbri

:shrug: Failed in CI. Looking into it.

ajfabbri avatar Aug 19 '22 02:08 ajfabbri

Force-push: Rebase on latest dev, and remove backoff time tweak that worked nice for scale tests, but caused failure for the unit tests, since the _wait_for_data_in_s3() loop requires 6 consecutive measurements to succeed, and a bigger backoff makes that less likely within the given timeout.

ajfabbri avatar Aug 19 '22 05:08 ajfabbri

/ci-repeat 5

ajfabbri avatar Aug 19 '22 05:08 ajfabbri

Force push: keep timeout constant as is for topic recovery tests, while still allowing scale test to pass in larger timeouts.

ajfabbri avatar Aug 19 '22 05:08 ajfabbri

/ci-repeat 5

ajfabbri avatar Aug 19 '22 05:08 ajfabbri

Some unrelated failures in the 5x CI run. All release builds passed, but 3/5 debug builds saw unrelated failures:

  • Build 1/5 saw #5608
  • Another run saw #4702
  • This run hit #5886

ajfabbri avatar Aug 19 '22 19:08 ajfabbri

Force-push: Rebase on latest dev and fix conflicts. Address two nits from @abhijat

ajfabbri avatar Aug 23 '22 23:08 ajfabbri

Force push: address nit (empty python method--we should add this to our linter).. CI failure is k8s operator (not related).

ajfabbri avatar Sep 06 '22 21:09 ajfabbri

LGTM One question. As I understand the test uses the same size based logic as old recovery test to validate results. But also it uses verifiable consumer. Is it correct? If this is the case, maybe we should get rid of size based validation and keep only verifiable consumer validation.

Thank you @Lazin .. I plan on doing some more refactoring on the test in future (to avoid subclassing the main recovery test directly).. and I think this is a good idea.

ajfabbri avatar Sep 07 '22 03:09 ajfabbri

k8s operator test is failing again

Lazin avatar Sep 12 '22 11:09 Lazin

Force-push: rebase to latest dev in hopes of resolving k8s tCI est issue.

ajfabbri avatar Sep 13 '22 00:09 ajfabbri