kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-16993: Flaky test shouldInvokeUserDefinedGlobalStateRestoreListener

Open bbejeck opened this issue 1 year ago • 6 comments

This PR tests the theory that the test is flaky due to the restoration ending previously, which means onRestoreSuspended is not called. The fix this PR provides is increasing the number of test records, which ensures the restoration does not complete too quickly.

I've kicked off at 5 additional builds and the flaky test in question passed each time.

More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.

Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

bbejeck avatar Aug 22 '24 21:08 bbejeck

The test has started failing at a different point, previously the test is flaky waiting for restoration to suspend. But this is puzzling as this is a part of the code note under test, and starting a second Kafka Streams application should encounter no issues. This PR has been recently rebased, so I'm wondering if it's at all possible that KAFKA-17432 could be involved. At any rate, I need to investigate further.

bbejeck avatar Aug 27 '24 18:08 bbejeck

Seems this test is full of race conditions? -- But it seems there is no easy way to control it fully...

Not sure if using more input records by itself is the right way to go though? Seems only to reduce the likelihood that it fails? Should we maybe instead change the test condition (eg, we could count down onRestoreSuspendedLatch also in onRestoreEnd callback)?

Or we could decrease MAX_POLL_RECORDS_CONFIG to 1 to slow down restoration even more, and maybe make the RESTORATION_DELAY "dynamic" -- ie, keep it at 500ms, but after we reached a condition, reduce it to zero?

For the new test failure: given that we need to restore 1000 records, now, it seems we might just need more time to transit to RUNNING. Not the test log line:

[shouldInvokeUserDefinedGlobalStateRestoreListeners_dZty6RRJSL5X__RAIPGg-ks2-StreamThread-1] task [0_0] Suspended from RESTORING

mjsax avatar Aug 27 '24 23:08 mjsax

Not sure if using more input records by itself is the right way to go though? Seems only to reduce the likelihood that it fails? Should we maybe instead change the test condition (eg, we could count down onRestoreSuspendedLatch also in onRestoreEnd callback)? Or we could decrease MAX_POLL_RECORDS_CONFIG to 1 to slow down restoration even more, and maybe make the RESTORATION_DELAY "dynamic" -- ie, keep it at 500ms, but after we reached a condition, reduce it to zero?

True, but IMHO, reducing the MAX_POLL_RECORD_CONFIG is a different side of the same coin so to speak.

Given that the goal of the test is to prove onRestoreSuspended get executed, I've removed the race condition for onRestoreSuspended and now check at the end of the test it's been called at least once.

bbejeck avatar Aug 28 '24 20:08 bbejeck

Given that the goal of the test is to prove onRestoreSuspended get executed, I've removed the race condition for onRestoreSuspended and now check at the end of the test it's been called at least once.

Not sure if I see why this would be different? If it's call because restore did not finish yet, the old and new code would pass, and if it's not called before restore was to fast, the old and new code would fail?

mjsax avatar Aug 28 '24 22:08 mjsax

Failures unrelated

bbejeck avatar Aug 29 '24 13:08 bbejeck

Not sure if I see why this would be different? If it's call because restore did not finish yet, the old and new code would pass, and if it's not called before restore was to fast, the old and new code would fail?

The previous failures always involved the timing of pausing the restoration. By removing the timing of the pause I believe the test is more stable, waiting for the end of the restoration is more reliable. As you said before, this test is full of race conditions and there's no way to remove all the timing checks.

bbejeck avatar Aug 29 '24 14:08 bbejeck

Another thought is to remove all the different timers and have just one that listens for the onRestorePause, since that is the main focus - and set it to something like 90 seconds. This would simplify things quite a bit.

bbejeck avatar Aug 29 '24 16:08 bbejeck

2nd round (4 test runs in total) no related failures 3rd round (6 test runs in total) no related failures 4th round (8 test runs) no related failures

bbejeck avatar Aug 29 '24 16:08 bbejeck

Failures unrelated

bbejeck avatar Sep 04 '24 12:09 bbejeck

Merged #16970 into trunk

bbejeck avatar Sep 04 '24 12:09 bbejeck