redpanda
redpanda copied to clipboard
test: use snapshots for detecting segment removal
Cover letter
Previously, the shadow indexing end-to-end test asserted against the current number of segments when checking for segment removal. This approach has the downside that a restart/failure of a redpanda node causes a segment roll, which makes the assertion unreliable in a context with simulated failures. See #5390 for more context.
This PR introduces a new utility for waiting for segments removal which uses snapshots to determine what was removed. The changes deflakes the test against a high number of injected node failures.
Fixes #5390
Backport Required
- [ ] not a bug fix
- [ ] papercut/not impactful enough to backport
- [ ] v22.2.x
- [x] v22.1.x
- [x] v21.11.x
UX changes
- none
Release notes
- none
Looks good after cross referencing the suggestions from the linked issue. I added ci-repeat-5 label since this is intended to fix a CI failure.
I'm not sure what that was supposed to do (probably run the CI 5 times), but the bot removed the label. One thing to note is that this fixes a specific failure mode of the test, so we'll have to go through the failures (if any) manually.
yes, it runs CI 5 times bot removed this tag when it started 5 jobs (19 out of 20 failed :smiley: )
I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR
I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR
Just went through them. There's two failures of the updated tests:
- Failure of
test_write_node_with_failureshere. This is a different failure mode: https://github.com/redpanda-data/redpanda/issues/4639 - Failure of
test_writehere. This is due to asserting against a fixed number of segments which I fixed in the last force push.
Let's have the CI do a few more runs to see if the failure mode from https://github.com/redpanda-data/redpanda/issues/5390 occurs.
/ci-repeat 5
I can't get ci-repeat to work. I've triggered another run manually.
Changes in force-push: Change segment count assertion to greater than.
I triggered 5 parallel CI runs. If the failure mode that this PR is trying to fix doesn't occur, I'd say it's fine to merge.
There was only one failure in the 5 runs: https://github.com/redpanda-data/redpanda/issues/6054. It's new, but I doubt that this PR has anything to do with it.
test_write_with_node_failures failures failed on one of the runs, but I think it's a legitimate timeout. The node that failed to remove its segments was stopped three times in a row and didn't get a chance to breach the retention policy and remove the last segment. Increasing the timeout decreases the likelihood of this scenario happening. I'll do that and run the CI again.
Changes in force push: increased the timeout as mentioned in this comment.
/ci-repeat 10
CI is happy now: https://buildkite.com/redpanda/redpanda/builds/14882. @LenaAn could you please re-approve if you're still happy with the change?
/backport v22.1.x
/backport v22.2.x
/backport v22.2.x