redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

test: use snapshots for detecting segment removal

Open VladLazar opened this issue 3 years ago • 0 comments

Cover letter

Previously, the shadow indexing end-to-end test asserted against the current number of segments when checking for segment removal. This approach has the downside that a restart/failure of a redpanda node causes a segment roll, which makes the assertion unreliable in a context with simulated failures. See #5390 for more context.

This PR introduces a new utility for waiting for segments removal which uses snapshots to determine what was removed. The changes deflakes the test against a high number of injected node failures.

Fixes #5390

Backport Required

  • [ ] not a bug fix
  • [ ] papercut/not impactful enough to backport
  • [ ] v22.2.x
  • [x] v22.1.x
  • [x] v21.11.x

UX changes

  • none

Release notes

  • none

VladLazar avatar Aug 03 '22 15:08 VladLazar

Looks good after cross referencing the suggestions from the linked issue. I added ci-repeat-5 label since this is intended to fix a CI failure.

I'm not sure what that was supposed to do (probably run the CI 5 times), but the bot removed the label. One thing to note is that this fixes a specific failure mode of the test, so we'll have to go through the failures (if any) manually.

VladLazar avatar Aug 10 '22 12:08 VladLazar

yes, it runs CI 5 times bot removed this tag when it started 5 jobs (19 out of 20 failed :smiley: )

LenaAn avatar Aug 10 '22 14:08 LenaAn

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

LenaAn avatar Aug 10 '22 15:08 LenaAn

I looked at a couple of CI failures and they were not related to this PR, but please look at them to see if we have some related to this PR

Just went through them. There's two failures of the updated tests:

  • Failure of test_write_node_with_failures here. This is a different failure mode: https://github.com/redpanda-data/redpanda/issues/4639
  • Failure of test_write here. This is due to asserting against a fixed number of segments which I fixed in the last force push.

Let's have the CI do a few more runs to see if the failure mode from https://github.com/redpanda-data/redpanda/issues/5390 occurs.

VladLazar avatar Aug 10 '22 16:08 VladLazar

/ci-repeat 5

VladLazar avatar Aug 11 '22 11:08 VladLazar

I can't get ci-repeat to work. I've triggered another run manually.

VladLazar avatar Aug 11 '22 13:08 VladLazar

Changes in force-push: Change segment count assertion to greater than.

VladLazar avatar Aug 12 '22 11:08 VladLazar

I triggered 5 parallel CI runs. If the failure mode that this PR is trying to fix doesn't occur, I'd say it's fine to merge.

VladLazar avatar Aug 16 '22 13:08 VladLazar

There was only one failure in the 5 runs: https://github.com/redpanda-data/redpanda/issues/6054. It's new, but I doubt that this PR has anything to do with it.

VladLazar avatar Aug 16 '22 17:08 VladLazar

test_write_with_node_failures failures failed on one of the runs, but I think it's a legitimate timeout. The node that failed to remove its segments was stopped three times in a row and didn't get a chance to breach the retention policy and remove the last segment. Increasing the timeout decreases the likelihood of this scenario happening. I'll do that and run the CI again.

VladLazar avatar Aug 30 '22 13:08 VladLazar

Changes in force push: increased the timeout as mentioned in this comment.

VladLazar avatar Aug 30 '22 13:08 VladLazar

/ci-repeat 10

VladLazar avatar Aug 30 '22 13:08 VladLazar

CI is happy now: https://buildkite.com/redpanda/redpanda/builds/14882. @LenaAn could you please re-approve if you're still happy with the change?

VladLazar avatar Aug 30 '22 16:08 VladLazar

/backport v22.1.x

VladLazar avatar Aug 31 '22 13:08 VladLazar

Branch name "v22.2.x" not found.

Workflow run logs.

vbotbuildovich avatar Aug 31 '22 13:08 vbotbuildovich

/backport v22.2.x

VladLazar avatar Aug 31 '22 13:08 VladLazar

Branch name "v22.2.x" not found.

Workflow run logs.

vbotbuildovich avatar Aug 31 '22 13:08 vbotbuildovich

/backport v22.2.x

VladLazar avatar Sep 01 '22 10:09 VladLazar

Branch name "v22.2.x" not found.

Workflow run logs.

vbotbuildovich avatar Sep 01 '22 10:09 vbotbuildovich