optimism icon indicating copy to clipboard operation
optimism copied to clipboard

CI SLAs

Open scharissis opened this issue 6 months ago • 2 comments

Explore whether we'd like to adopt internal SLAs for our continuous integration pipeline. If we would like to do so, produce a first draft and get it ratified.

scharissis avatar Jun 16 '25 07:06 scharissis

Why SLAs/SLOs Make Sense

  1. Clear expectations; engineering teams know what to expect from CI performance
  2. Accountability; Platforms team has measurable targets to optimize against
  3. Objective metrics; we can objectively discuss CI health with concrete metrics

scharissis avatar Jun 16 '25 07:06 scharissis

CI Pipeline SLAs & Accountability Framework

Service Level Agreements

Availability & Performance

  • Job Success Rate: 95% (excluding legitimate test failures)
  • PR Feedback Time: 80% of PR pipelines complete within 1 hour

Developer Experience

  • Flaky Test Identification: Flaky tests are identified within a day of flaking
  • Flaky Test Remediation: Flaky tests are corrected within two weeks of first flake
  • Clear Failure Attribution: Failures clearly indicate whether the issue is with the code or infrastructure
  • PR Merge Time: Time from PR open to merge is is within 7 days (168 hours)
  • Merge queue failure rate + merge queue time: Failure rate is <5%, Queue Time is <1h

Future Metrics

These are metrics that would be good to include, but we aren't/can't measure them at this time.

  • CI System Uptime: 99%

Accountability Framework

Bad Test Resolution

Severity Impact Protocol Team Platforms Team
P0 develop blocked Fix/disable in 2hrs Can disable in 4hrs
P1 >50% PRs failing Fix in 8hrs Implement retries in 4hrs
P2 <50% intermittent Fix in 2 days Monitor & diagnose in 1 day

Responsibility Matrix

  • Flaky test logic → Protocol owns, Platforms provides diagnostics
  • Infrastructure issues → Platforms owns and fixes
  • External dependencies → Joint responsibility, Platforms implements fallbacks

How They Work Together

SLAs define the targetsAccountability ensures they're met

Key Metrics

  • MTTD: Mean Time to Detection (<2 hours)
  • MTTM: Mean Time to Mitigation (<8 hours)
  • MTTR: Mean Time to Resolution (<1 business day for P0, <14 business days otherwise)

scharissis avatar Jun 16 '25 07:06 scharissis

Seeking feedback on the above. Especially if there's any irrelevant metrics that we can remove. We wish to close this out before the end of the week.

scharissis avatar Jun 24 '25 01:06 scharissis

As a developer I'm not sure these SLAs really matter to me. I've never had SLAs for CI before and never had reason to ask for or care about them. They don't hurt either so if they're useful for platforms by all means carry on. Just know that we developers are going to complain pretty much whenever there's an issue regardless of whether it's within the SLA or not. We're fun like that. :)

In terms of the actual numbers though:

  • 5% expected failure rate feels very high. I'd expect that to translate to multiple failures each day (spread across different developers so not as bad as it sounds).
  • An hour is an extraordinarily long time for CI to complete - especially for only 80% of PRs. Pre-merge checks should be under 10 minutes as a base level standard (which we've generally failed to meet). Post-merge can be longer and for some types of tests reducing runtime either isn't an objective or isn't feasible (e.g. sync tests or long running performance tests).
  • The idea of disabling checks that are breaking seems wrong. If they're new checks then yes we should roll them back, but otherwise it is generally a very bad idea to disable checks and can lead to multiple issues slipping through which make it very hard to get back to things passing.
  • PR merge time is mostly dependent on when people provide reviews and how much rework/discussion is required. I'm not sure it's a meaningful CI metric.

ajsutton avatar Jun 24 '25 04:06 ajsutton

Seconding @ajsutton's comment. Right now the issue afflicting CI is flaky tests, and we can end the game of whack-a-mole if we either (1) develop a strong process for rapidly fixing flaky tests in the first place (represented by this proposal), or (2) get a little smarter about how we allow new tests into the repo.

Option (2) seems smarter to me. What if we do a diff of which test cases have been introduced to the repo in a given PR, and for any new test cases we run many instances of those tests in parallel to proactively detect flaky tests and reject their admittance to the repo?

teddyknox avatar Jun 24 '25 14:06 teddyknox

I love that you're looking at ways to keep the infrastructure great with metrics. I like the flaky test identification, and clear failure attribution. And I agree with @ajsutton's comment about PR merge time, it isn't really an infra SLA, it's dependent on the PR authors and reviewers. Also, flaky test remediation isn't necessarily a platforms issue.

And I agree that it's dangerous to disable tests. Shouldn't it actually be reverting the PR that broke the tests?

pauldowman avatar Jun 24 '25 15:06 pauldowman

Thank you very much for the feedback all!

I like the idea of "getting smarter about how we allow new tests into the repo". We've previously discussed potentially introducing a staging gate, for example.

I'll have a think about it, produce a lightweight proposal with some ideas and send it around for some feedback! (probably after the Pause, realistically)

scharissis avatar Jun 25 '25 05:06 scharissis

Will not implement CI SLAs for now; closing.

scharissis avatar Jun 25 '25 05:06 scharissis