Explore whether we'd like to adopt internal SLAs for our continuous integration pipeline. If we would like to do so, produce a first draft and get it ratified.

Jun 16 '25 07:06 scharissis

Why SLAs/SLOs Make Sense

Clear expectations; engineering teams know what to expect from CI performance
Accountability; Platforms team has measurable targets to optimize against
Objective metrics; we can objectively discuss CI health with concrete metrics

Jun 16 '25 07:06 scharissis

CI Pipeline SLAs & Accountability Framework

Service Level Agreements

Availability & Performance

Job Success Rate: 95% (excluding legitimate test failures)
PR Feedback Time: 80% of PR pipelines complete within 1 hour

Developer Experience

Flaky Test Identification: Flaky tests are identified within a day of flaking
Flaky Test Remediation: Flaky tests are corrected within two weeks of first flake
Clear Failure Attribution: Failures clearly indicate whether the issue is with the code or infrastructure
PR Merge Time: Time from PR open to merge is is within 7 days (168 hours)
Merge queue failure rate + merge queue time: Failure rate is <5%, Queue Time is <1h

Future Metrics

These are metrics that would be good to include, but we aren't/can't measure them at this time.

CI System Uptime: 99%

Accountability Framework

Bad Test Resolution

Severity	Impact	Protocol Team	Platforms Team
P0	`develop` blocked	Fix/disable in 2hrs	Can disable in 4hrs
P1	>50% PRs failing	Fix in 8hrs	Implement retries in 4hrs
P2	<50% intermittent	Fix in 2 days	Monitor & diagnose in 1 day

Responsibility Matrix

Flaky test logic → Protocol owns, Platforms provides diagnostics
Infrastructure issues → Platforms owns and fixes
External dependencies → Joint responsibility, Platforms implements fallbacks

How They Work Together

SLAs define the targets → Accountability ensures they're met

Key Metrics

MTTD: Mean Time to Detection (<2 hours)
MTTM: Mean Time to Mitigation (<8 hours)
MTTR: Mean Time to Resolution (<1 business day for P0, <14 business days otherwise)

Jun 16 '25 07:06 scharissis

Seeking feedback on the above. Especially if there's any irrelevant metrics that we can remove. We wish to close this out before the end of the week.

Jun 24 '25 01:06 scharissis

As a developer I'm not sure these SLAs really matter to me. I've never had SLAs for CI before and never had reason to ask for or care about them. They don't hurt either so if they're useful for platforms by all means carry on. Just know that we developers are going to complain pretty much whenever there's an issue regardless of whether it's within the SLA or not. We're fun like that. :)

In terms of the actual numbers though:

5% expected failure rate feels very high. I'd expect that to translate to multiple failures each day (spread across different developers so not as bad as it sounds).
An hour is an extraordinarily long time for CI to complete - especially for only 80% of PRs. Pre-merge checks should be under 10 minutes as a base level standard (which we've generally failed to meet). Post-merge can be longer and for some types of tests reducing runtime either isn't an objective or isn't feasible (e.g. sync tests or long running performance tests).
The idea of disabling checks that are breaking seems wrong. If they're new checks then yes we should roll them back, but otherwise it is generally a very bad idea to disable checks and can lead to multiple issues slipping through which make it very hard to get back to things passing.
PR merge time is mostly dependent on when people provide reviews and how much rework/discussion is required. I'm not sure it's a meaningful CI metric.

Jun 24 '25 04:06 ajsutton

Seconding @ajsutton's comment. Right now the issue afflicting CI is flaky tests, and we can end the game of whack-a-mole if we either (1) develop a strong process for rapidly fixing flaky tests in the first place (represented by this proposal), or (2) get a little smarter about how we allow new tests into the repo.

Option (2) seems smarter to me. What if we do a diff of which test cases have been introduced to the repo in a given PR, and for any new test cases we run many instances of those tests in parallel to proactively detect flaky tests and reject their admittance to the repo?

Jun 24 '25 14:06 teddyknox

I love that you're looking at ways to keep the infrastructure great with metrics. I like the flaky test identification, and clear failure attribution. And I agree with @ajsutton's comment about PR merge time, it isn't really an infra SLA, it's dependent on the PR authors and reviewers. Also, flaky test remediation isn't necessarily a platforms issue.

And I agree that it's dangerous to disable tests. Shouldn't it actually be reverting the PR that broke the tests?

Jun 24 '25 15:06 pauldowman

Thank you very much for the feedback all!

I like the idea of "getting smarter about how we allow new tests into the repo". We've previously discussed potentially introducing a staging gate, for example.

I'll have a think about it, produce a lightweight proposal with some ideas and send it around for some feedback! (probably after the Pause, realistically)

Jun 25 '25 05:06 scharissis

Will not implement CI SLAs for now; closing.

Jun 25 '25 05:06 scharissis