Use retry action, don't retry otherwise
This uses a action for retrying steps, which is a bit neater, and lets us more clearly specify what's going on, as well as controlling the number of retries independently.
I think it's better to specifically use this on the test suites that we believe to be flaky, rather than adding a lot of noise by doing this on every test invocation.
Let's fix flaky tests properly. I propose we invert this patch to retry all testsuites thrice unconditionally and fail if any run fails. Then we will have a list of most flaky tests.
I've started a patch to fix flaky testsuite items in ghcide.
Let's fix flaky tests properly. I propose we invert this patch to retry all testsuites thrice unconditionally and fail if any run fails.
I'm definitely in favour of fixing flaky tests, I just want our CI to be passing for most people instead of requiring constant restarting as it does right now. Then we can make a ticket for each flaky job to work on it. Having things fail a lot is an incentive to fix the tests for sure, but it also just slows everything down tons.
That said if you think you can actually fix the flakiness soon then go for it!
The retry action also seems to add a warning to each run with retries by default (warning_on_retry). So flaky tests can be identified by going through past runs (and not failing pull requests that didn't cause them). Haven't seen this action before, but looks to me like a nice improvement over the current state.
https://github.com/nick-fields/retry#warning_on_retry
@wz1000 did you get most of the flaky tests? I'd be happy to just remove the retrying on everything if you think it's not needed now.
No, #3423 seems like an issue that results in quite some flakiness throughout the testsuite but it will require a bit of work to resolve.