redpanda test: use snapshots for detecting segment removal

Cover letter

Previously, the shadow indexing end-to-end test asserted against the current number of segments when checking for segment removal. This approach has the downside that a restart/failure of a redpanda node causes a segment roll, which makes the assertion unreliable in a context with simulated failures. See #5390 for more context.

This PR introduces a new utility for waiting for segments removal which uses snapshots to determine what was removed. The changes deflakes the test against a high number of injected node failures.

Fixes #5390

Backport Required

[ ] not a bug fix
[ ] papercut/not impactful enough to backport
[ ] v22.2.x
[x] v22.1.x
[x] v21.11.x

UX changes

none

Release notes

none

Aug 03 '22 15:08 VladLazar

Looks good after cross referencing the suggestions from the linked issue. I added ci-repeat-5 label since this is intended to fix a CI failure.

I'm not sure what that was supposed to do (probably run the CI 5 times), but the bot removed the label. One thing to note is that this fixes a specific failure mode of the test, so we'll have to go through the failures (if any) manually.

Aug 10 '22 12:08 VladLazar

yes, it runs CI 5 times bot removed this tag when it started 5 jobs (19 out of 20 failed :smiley: )

Aug 10 '22 14:08 LenaAn

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

Aug 10 '22 15:08 LenaAn

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

Just went through them. There's two failures of the updated tests:

Failure of test_write_node_with_failures here. This is a different failure mode: https://github.com/redpanda-data/redpanda/issues/4639
Failure of test_write here. This is due to asserting against a fixed number of segments which I fixed in the last force push.

Let's have the CI do a few more runs to see if the failure mode from https://github.com/redpanda-data/redpanda/issues/5390 occurs.

Aug 10 '22 16:08 VladLazar

/ci-repeat 5

Aug 11 '22 11:08 VladLazar

I can't get ci-repeat to work. I've triggered another run manually.

Aug 11 '22 13:08 VladLazar

Changes in force-push: Change segment count assertion to greater than.

Aug 12 '22 11:08 VladLazar

I triggered 5 parallel CI runs. If the failure mode that this PR is trying to fix doesn't occur, I'd say it's fine to merge.

Aug 16 '22 13:08 VladLazar

There was only one failure in the 5 runs: https://github.com/redpanda-data/redpanda/issues/6054. It's new, but I doubt that this PR has anything to do with it.

Aug 16 '22 17:08 VladLazar

test_write_with_node_failures failures failed on one of the runs, but I think it's a legitimate timeout. The node that failed to remove its segments was stopped three times in a row and didn't get a chance to breach the retention policy and remove the last segment. Increasing the timeout decreases the likelihood of this scenario happening. I'll do that and run the CI again.

Aug 30 '22 13:08 VladLazar

Changes in force push: increased the timeout as mentioned in this comment.

Aug 30 '22 13:08 VladLazar

/ci-repeat 10

Aug 30 '22 13:08 VladLazar

CI is happy now: https://buildkite.com/redpanda/redpanda/builds/14882. @LenaAn could you please re-approve if you're still happy with the change?

Aug 30 '22 16:08 VladLazar

/backport v22.1.x

Aug 31 '22 13:08 VladLazar

Branch name "v22.2.x" not found.

Workflow run logs.

Aug 31 '22 13:08 vbotbuildovich

/backport v22.2.x

Aug 31 '22 13:08 VladLazar

Branch name "v22.2.x" not found.

Workflow run logs.

Aug 31 '22 13:08 vbotbuildovich

/backport v22.2.x

Sep 01 '22 10:09 VladLazar

Branch name "v22.2.x" not found.

Workflow run logs.

Sep 01 '22 10:09 vbotbuildovich

redpanda redpanda copied to clipboard

test: use snapshots for detecting segment removal

Cover letter

Backport Required

UX changes

Release notes

redpanda
redpanda copied to clipboard