kafka
kafka copied to clipboard
KAFKA-16993: Flaky test shouldInvokeUserDefinedGlobalStateRestoreListener
This PR tests the theory that the test is flaky due to the restoration ending previously, which means onRestoreSuspended is not called. The fix this PR provides is increasing the number of test records, which ensures the restoration does not complete too quickly.
I've kicked off at 5 additional builds and the flaky test in question passed each time.
More detailed description of your change, if necessary. The PR title and PR message become the squashed commit message, so use a separate comment to ping reviewers.
Summary of testing strategy (including rationale) for the feature or bug fix. Unit and/or integration tests are expected for any behaviour change and system tests should be considered for larger changes.
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
The test has started failing at a different point, previously the test is flaky waiting for restoration to suspend. But this is puzzling as this is a part of the code note under test, and starting a second Kafka Streams application should encounter no issues. This PR has been recently rebased, so I'm wondering if it's at all possible that KAFKA-17432 could be involved. At any rate, I need to investigate further.
Seems this test is full of race conditions? -- But it seems there is no easy way to control it fully...
Not sure if using more input records by itself is the right way to go though? Seems only to reduce the likelihood that it fails? Should we maybe instead change the test condition (eg, we could count down onRestoreSuspendedLatch also in onRestoreEnd callback)?
Or we could decrease MAX_POLL_RECORDS_CONFIG to 1 to slow down restoration even more, and maybe make the RESTORATION_DELAY "dynamic" -- ie, keep it at 500ms, but after we reached a condition, reduce it to zero?
For the new test failure: given that we need to restore 1000 records, now, it seems we might just need more time to transit to RUNNING. Not the test log line:
[shouldInvokeUserDefinedGlobalStateRestoreListeners_dZty6RRJSL5X__RAIPGg-ks2-StreamThread-1] task [0_0] Suspended from RESTORING
Not sure if using more input records by itself is the right way to go though? Seems only to reduce the likelihood that it fails? Should we maybe instead change the test condition (eg, we could count down onRestoreSuspendedLatch also in onRestoreEnd callback)? Or we could decrease MAX_POLL_RECORDS_CONFIG to 1 to slow down restoration even more, and maybe make the RESTORATION_DELAY "dynamic" -- ie, keep it at 500ms, but after we reached a condition, reduce it to zero?
True, but IMHO, reducing the MAX_POLL_RECORD_CONFIG is a different side of the same coin so to speak.
Given that the goal of the test is to prove onRestoreSuspended get executed, I've removed the race condition for onRestoreSuspended and now check at the end of the test it's been called at least once.
Given that the goal of the test is to prove onRestoreSuspended get executed, I've removed the race condition for onRestoreSuspended and now check at the end of the test it's been called at least once.
Not sure if I see why this would be different? If it's call because restore did not finish yet, the old and new code would pass, and if it's not called before restore was to fast, the old and new code would fail?
Failures unrelated
Not sure if I see why this would be different? If it's call because restore did not finish yet, the old and new code would pass, and if it's not called before restore was to fast, the old and new code would fail?
The previous failures always involved the timing of pausing the restoration. By removing the timing of the pause I believe the test is more stable, waiting for the end of the restoration is more reliable. As you said before, this test is full of race conditions and there's no way to remove all the timing checks.
Another thought is to remove all the different timers and have just one that listens for the onRestorePause, since that is the main focus - and set it to something like 90 seconds. This would simplify things quite a bit.
2nd round (4 test runs in total) no related failures 3rd round (6 test runs in total) no related failures 4th round (8 test runs) no related failures
Failures unrelated
Merged #16970 into trunk