util icon indicating copy to clipboard operation
util copied to clipboard

Attempt to fix flaky util-app test in GitHub CI

Open felixbr opened this issue 3 years ago • 5 comments

As you can see in many PRs (e.g. #308) there is often one of the different build matrix cases red. As far as I can see it's always the same flaky test case in AppTest (in util-app).

I tried many things but couldn't reproduce the timeout locally. As an attempt to fight the failures in CI, I'm increasing the timeout for the grace period and also add retries to the test case.

I understand that this is mostly a band aid and not a great solution, so if you don't like it please let me know how we should tackle this instead.

Cheers ~ Felix

felixbr avatar Aug 02 '22 10:08 felixbr

Does just increasing the defaultCloseGracePeriod to 30 seconds like you've done do the trick? It seems like a real bug if we cannot ever get the right exit behavior regardless of the grace period.

bryce-anderson avatar Aug 02 '22 15:08 bryce-anderson

Well, locally it always works, even with a very very low grace period. My suspicion is that a different timeout (e.g. from Await.ready) somewhere is firing the TimeoutException and so increasing the grace period is not really helping to "fix" CI.

That is purely speculation, of course, as I couldn't reproduce it even once locally.

I've seen similar problems at work where tests would basically be starved in CI because sbt tries to parallelize heavily but the machine has very little CPU and so futures in tests would be created but then paused for very long periods, leading to timeouts.

If you want me to, I can remove the retry part but eventually will only retry for a while and then fail the test anyway. So if it was really broken it wouldn't change anything for the outcome.

felixbr avatar Aug 05 '22 09:08 felixbr

If you want me to, I can remove the retry part but eventually will only retry for a while and then fail the test anyway. So if it was really broken it wouldn't change anything for the outcome.

I would prefer to try it. If it's not a problem simply due to "not enough CPU time", which I think you're right that it is, and it's actually flaky for other reasons that is broken and should be fixed and retrying will mask the flakiness/brokenness.

bryce-anderson avatar Aug 11 '22 15:08 bryce-anderson

I've rebased this onto develop and reverted the addition of eventually retries. So it's only an increased timeout and a comment now.

felixbr avatar May 18 '23 15:05 felixbr

I've rebased this onto develop and reverted the addition of eventually retries. So it's only an increased timeout and a comment now.

felixbr avatar May 18 '23 15:05 felixbr