sentry-cocoa Intelligent flakiness mitigation

Description

We've had a persistent problem with flaky tests for a long time.

One process we've tried using is to disable a flaky test with a PR, and open an issue to fix it. This has not eradicated the issues, and doesn't scale that well.

One automation we've attempted in the past is to allow up to a certain number of retries of any test invocation that is known to contain flakes. It was recently removed for our saucelabs runs in https://github.com/getsentry/sentry-cocoa/pull/3660 but still is in use for our UI test ASAN runs:https://github.com/getsentry/sentry-cocoa/blob/e0904ef6b8b510e9739820aa29379b074b26e237/.github/workflows/ui-tests.yml#L96

However, we have unit and UI tests that basically always contain at least one flake. This requires each PR author to check the test run, read what failed, decide if it's a real failure related to their changes, and use the web UI to initiate a retry of the failed tests, and then wait for that round to complete again. This process adds hours to each PR.

I propose automating that manual process the same way we've done it elsewhere, by using the retry strategy with https://github.com/getsentry/sentry-cocoa/pull/3881 and https://github.com/getsentry/sentry-cocoa/pull/3883.

Then, we should write some automation that will extract the name of any test that did not fail in every retry, where more than one test run was actually required, and report that in a PR comment so we can more easily perform our current flaky test process of disabling it and filing an issue to fix it, vs having to dive into the GHA logs for each one.

Even that could potentially be automated, since we can open github issues with their API, and skipping a test is just inserting a line of text into the .xcscheme file. We could also pipe the event to slack so we are aware of it happening there, too.

Apr 24 '24 18:04 armcknight

Thanks @armcknight, I agree with a lot here. A few thoughts/opinions:

if we run these tests, they should be green, we shouldn't need to ignore failed checks when merging PRs
- do we need to run these tests on every PR? IMO there should be no reason not to
- we have flaky tests, and the current process doesn't work well in surfacing/eliminating them
- as a stopgap (!), it's IMO valid to add auto-retry (and keep manually disabling flaky tests as needed)
we definitely need to iterate on that short to medium term
- we're already planning to spend some time on improving CI/flakiness in the next few weeks/ in Q2
- we should automatically surface fails/retries - e.g. in PR comments as suggested above
- can we also capture fails/retries as errors in https://sentry.sentry.io/issues/?project=5899451? & we can setup threshold alerts and send to #team-cocoa etc.
whatever we do, we will iterate

Apr 26 '24 15:04 kahest

I like the idea to send these to sentry.sentry.io so we can get slack alerts, that would be nicer than rebuilding a slack alerter from scratch in the test scripts.

Let me know how I can help if you have more plans for CI/etc this quarter!

Jun 11 '24 22:06 armcknight