CI SLAs
Explore whether we'd like to adopt internal SLAs for our continuous integration pipeline. If we would like to do so, produce a first draft and get it ratified.
Why SLAs/SLOs Make Sense
- Clear expectations; engineering teams know what to expect from CI performance
- Accountability; Platforms team has measurable targets to optimize against
- Objective metrics; we can objectively discuss CI health with concrete metrics
CI Pipeline SLAs & Accountability Framework
Service Level Agreements
Availability & Performance
- Job Success Rate: 95% (excluding legitimate test failures)
- PR Feedback Time: 80% of PR pipelines complete within 1 hour
Developer Experience
- Flaky Test Identification: Flaky tests are identified within a day of flaking
- Flaky Test Remediation: Flaky tests are corrected within two weeks of first flake
- Clear Failure Attribution: Failures clearly indicate whether the issue is with the code or infrastructure
- PR Merge Time: Time from PR open to merge is is within 7 days (168 hours)
- Merge queue failure rate + merge queue time: Failure rate is <5%, Queue Time is <1h
Future Metrics
These are metrics that would be good to include, but we aren't/can't measure them at this time.
- CI System Uptime: 99%
Accountability Framework
Bad Test Resolution
| Severity | Impact | Protocol Team | Platforms Team |
|---|---|---|---|
| P0 | develop blocked |
Fix/disable in 2hrs | Can disable in 4hrs |
| P1 | >50% PRs failing | Fix in 8hrs | Implement retries in 4hrs |
| P2 | <50% intermittent | Fix in 2 days | Monitor & diagnose in 1 day |
Responsibility Matrix
- Flaky test logic → Protocol owns, Platforms provides diagnostics
- Infrastructure issues → Platforms owns and fixes
- External dependencies → Joint responsibility, Platforms implements fallbacks
How They Work Together
SLAs define the targets → Accountability ensures they're met
Key Metrics
- MTTD: Mean Time to Detection (<2 hours)
- MTTM: Mean Time to Mitigation (<8 hours)
- MTTR: Mean Time to Resolution (<1 business day for P0, <14 business days otherwise)
Seeking feedback on the above. Especially if there's any irrelevant metrics that we can remove. We wish to close this out before the end of the week.
As a developer I'm not sure these SLAs really matter to me. I've never had SLAs for CI before and never had reason to ask for or care about them. They don't hurt either so if they're useful for platforms by all means carry on. Just know that we developers are going to complain pretty much whenever there's an issue regardless of whether it's within the SLA or not. We're fun like that. :)
In terms of the actual numbers though:
- 5% expected failure rate feels very high. I'd expect that to translate to multiple failures each day (spread across different developers so not as bad as it sounds).
- An hour is an extraordinarily long time for CI to complete - especially for only 80% of PRs. Pre-merge checks should be under 10 minutes as a base level standard (which we've generally failed to meet). Post-merge can be longer and for some types of tests reducing runtime either isn't an objective or isn't feasible (e.g. sync tests or long running performance tests).
- The idea of disabling checks that are breaking seems wrong. If they're new checks then yes we should roll them back, but otherwise it is generally a very bad idea to disable checks and can lead to multiple issues slipping through which make it very hard to get back to things passing.
- PR merge time is mostly dependent on when people provide reviews and how much rework/discussion is required. I'm not sure it's a meaningful CI metric.
Seconding @ajsutton's comment. Right now the issue afflicting CI is flaky tests, and we can end the game of whack-a-mole if we either (1) develop a strong process for rapidly fixing flaky tests in the first place (represented by this proposal), or (2) get a little smarter about how we allow new tests into the repo.
Option (2) seems smarter to me. What if we do a diff of which test cases have been introduced to the repo in a given PR, and for any new test cases we run many instances of those tests in parallel to proactively detect flaky tests and reject their admittance to the repo?
I love that you're looking at ways to keep the infrastructure great with metrics. I like the flaky test identification, and clear failure attribution. And I agree with @ajsutton's comment about PR merge time, it isn't really an infra SLA, it's dependent on the PR authors and reviewers. Also, flaky test remediation isn't necessarily a platforms issue.
And I agree that it's dangerous to disable tests. Shouldn't it actually be reverting the PR that broke the tests?
Thank you very much for the feedback all!
I like the idea of "getting smarter about how we allow new tests into the repo". We've previously discussed potentially introducing a staging gate, for example.
I'll have a think about it, produce a lightweight proposal with some ideas and send it around for some feedback! (probably after the Pause, realistically)
Will not implement CI SLAs for now; closing.