k8s.io 24x7 k8s-infra on-call

Kubernetes is a global project and used in enough mission-critical environments that support non-prow-related issues should have 24x7 on-call rotation (activable by any Kubernetes member) with a primary and standby.

What are the limitations that would prevent a call for volunteers to go out to augment the existing team that is doing their very best but are ultimately PST bound and providing support on a best-effort basis?

I know there was a call for general volunteers and I am not sure how successful that was, what I am referring to is a specific call for people to be added to an official on-call rotation (after meeting pre-determined criteria, vetting, and continued participation in the wg, etc.. )

@kubernetes/steering-committee

Mar 11 '21 19:03 moshloop

@moshloop The tl;dr is that if you want this, show up and do the work to help us get to where we need to be. I want this same thing, but can come off as pessimistic when I laundry-list like this, so maybe @dims or @thockin have a more optimistic take.

My experience is that "Congratulations, you're on-call! Figure out what that means and how to do it. Oh and you can break things really badly if you're not careful. Good luck!" does not typically go over well.

What are the limitations that would prevent a call for volunteers to go out to augment the existing team that is doing their very best but are ultimately PST bound and providing support on a best-effort basis?

Trust:

that people can't break critical aspects of the project
that people can't misappropriate funding (aka "won't mine bitcoin")
that since "can't" requires engineering, people with elevated privileges "won't"
if critical aspects of the project are broken, people are committed to resolving it
that any potential PII is handled according to whatever guidelines CNCF dictates

Bandwidth to achieve that:

reduce the scope/blast-radius of individuals and changes
- tightening down overprivileged access takes time
- gaining trust in our automation takes time (it's bash, there is barely any testing, a wrong typo could result in lots of unplanned work)
- reviewing changes to our automation to improve the above takes time
- ensuring disaster recovery plans and capabilities are in place takes time
setup the "infra" to allow onboarding volunteers
- documenting rules of engagement and policies takes time
- setting up the mechanics for on-call takes time
- herding/onboarding/organizing an influx of new members into roles takes time

after meeting pre-determined criteria, vetting, and continued participation in the wg, etc.. )

So I think the above describes work to be done to make the first two points possible/lower-barrier. It is work that people other than WG leads can do; we need to review it, but consistently showing up and doing the work will establish trust, and reduce our review burden.

I know there was a call for general volunteers and I am not sure how successful that was

To be honest, my impression is that a lot of people showed up at first, and were very interested. Then it quickly become clear that the "boring" sort of work I listed above was what we actually needed to unblock on overburdened WG leads.

To turn the question back around, what do you think next steps would be to get to 24x7 on-call? What would the scope of that on-call be?

Mar 12 '21 17:03 spiffxp

For what it's worth, I'd love to see this be more formalized. But I agree with Aaron - it's not as simple as getting people to sign up. This is a somewhat under-serviced WG right now. We're incrementally, slowly, piece by piece getting better, but part of that is also revisiting prior decisions with new context.

When it was just me editing the bash, it was fine. Now that we have a) a lot more bash and b) more people modifying it, the shortcomings of that decision become clearer.

In addition to that, we don't have great "playbooks" in large part "because nobody has written them". I can't, in good conscience, ask people to take responsibility for that. It has to start with people helping bootstrap the whole endeavour.

Mar 12 '21 17:03 thockin

Thanks for the feedback - I will draft a working document for discussion at the next meeting.

Mar 15 '21 19:03 moshloop

Work in progress is here: https://docs.google.com/document/d/1ih_h-0qnX8B0VjtoE1a6YjF8VTimoZ1jm_U78wnyu0g/edit

Mar 18 '21 06:03 moshloop

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jun 16 '21 07:06 fejta-bot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

Jul 17 '21 02:07 fejta-bot

/remove-lifecycle rotten lifecycle frozen

Jul 17 '21 08:07 ameukam

/priority important-longterm

Sep 02 '21 19:09 spiffxp

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 28 '21 22:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 27 '22 22:01 k8s-triage-robot

/remove-lifecycle rotten /lifecycle frozen

Jan 28 '22 04:01 ameukam

I don't see any progress or any improvement regarding this.

Closing for now. /close

Jul 27 '23 20:07 ameukam

@ameukam: Closing this issue.

In response to this:

I don't see any progress or any improvement regarding this.

Closing for now. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 27 '23 20:07 k8s-ci-robot

k8s.io k8s.io copied to clipboard

24x7 k8s-infra on-call

k8s.io
k8s.io copied to clipboard