toc icon indicating copy to clipboard operation
toc copied to clipboard

[Initiative]: Reference framework for the levels of Service Reliability Automation

Open svrnm opened this issue 3 weeks ago • 6 comments

Name

Levels Of Service Reliability Automation

Short description

Create a reference document that provides a framework around "levels of service reliability automation". It allows end-users to identify where they are standing right now, and how they can improve. Likewise it allows open source projects and commercial products to position where they can help to go from one level to another.

Responsible group

TAG Operational Resilience

Does the initiative belong to a subproject?

Yes

Subproject name

No response

Primary contact

Severin Neumann, Causely, (@svrnm)

Additional contacts

Steffen Geissinger, Causely, (@ib-steffen) Will Hegedus, Linode (@wbh1) Diana Todea, VictoriaMetrics (@didiViking) Vitor Vasconcellos, MercadoLibre (@vitorvasc)

Initiative description

Motivation

Reliability today is often framed narrowly as "observability", "incident response" and "troubleshooting". We especially see this when AI SREs are pitched as support humans in the loop or take them out of the loop, but basically offer a llm-powered on call automation, that's focused on those three domains. These tools mostly support reactive troubleshooting.

However, true service reliability (or operational resilience) spans the entire lifecycle, from building reliable software, to resource management, to release management, to maintaining and improving on going operations, observability, incident response, troubleshooting and more.

With this proposal, we want to show a bigger picture: how reliability engineering tasks (done by SREs, developers, ...) can climb a ladder of autonomy from manual work to autonomy.

Goal

The goal of this initiatve is to create a reference document that provides a framework around "levels of service reliability automation". It allows end-users to identify where they are standing right now, and how they can improve. Likewise it allows open source projects and commercial products to position where they can help to go from one level to another.

Examples

  • In software development teams might do manual analysis of their code today towards reliability issues, or they have automation to identify potential issues and the required changes.
  • For ongoing operations the team might do manual resource management, or they have automation that scales the resources up and down autonomously alongside predefined guard railes
  • For observability teams might add instrumentation to their code manually, or they leverage different kinds of automation that either provides out of the box instrumentation, or LLM-based guidance where and how improvements can be made
  • ...

Inspiration

We borrow the framing from the SAE “Levels of Driving Automation”. Just as cars progress from manual to fully autonomous driving, reliability systems can progress from manual to autonomous reliability. This analogy creates a shared language to describe where the industry is today and where it’s heading, e.g. there might be Levels 0 ("manual"), 1 ("rule book automation"), 2 ("reactive assistants"), 3 ("proactive guidance"), 4 ("human in the loop autonomy"), 5 ("full autonomy"). These levels might then be defined for the different domains called out above.

Note 1: that his not necessarily how this whitepaper needs to be structured, or how those levels need to look like. Also for some domains it will be not necessary to go through all the levels, or not all levels make sense. The goal of the initiative is to charter that "map" in collaboration.

Note 2: While AI (and especially LLMs) play a big role in accomplishing the higher levels in this framework, they are not a necessity and a valuable outcome of this framework might be that it can help to identify where AI is the right tool and where it might be a wrong fit or maybe even harmful.

Scope

The scope of this project is around providing that framework, create a common language and examples per category. It may already position certain projects and/or products to verify its applicability, but it does not have the goal to provide a complete "landscape", this may be a follow up activity.

The scope also excludes to build or design any projects to "fill out" the levels in some of the domains, although it might be taken as inspiration for existing projects to take on that task or new projects to emerge.

Deliverable(s) or exit criteria

A shared reference document that contains the framework as outlined above

Tracking document for meeting and progress

tbd

svrnm avatar Dec 03 '25 13:12 svrnm

Targeted Audience: SRE's, Service Owner, Poduct companies in the Reliability / Otel Space Discussed in the meeting of 03.12.2025 - Severin agreed to reach out to some more people and then put this up for vote in January 2026

mfahlandt avatar Dec 03 '25 16:12 mfahlandt

I think this is a great idea 🙂 Commenting here to express interest in participating and assisting if this is approved.

wbh1 avatar Dec 04 '25 12:12 wbh1

I'm interested in contributing, I support this initiative.

didiViking avatar Dec 04 '25 13:12 didiViking

It sounds really great! I'm interested in participating. 😄

vitorvasc avatar Dec 04 '25 21:12 vitorvasc

/tag tag/operational-resilience

kevin-wangzefeng avatar Dec 08 '25 14:12 kevin-wangzefeng

It sounds really interesting, I would like to jump in and participate to this initiative! Could you please take me in the loop? 🙏🏻

graz-dev avatar Dec 15 '25 13:12 graz-dev

@graz-dev of course, added you along side everyone else to the issue description!

We are still looking for more people who are interested to contribute and then hopefully after a successful vote on the project in early 2026 we can get started :-)

svrnm avatar Dec 16 '25 07:12 svrnm