k8s.io
k8s.io copied to clipboard
24x7 k8s-infra on-call
Kubernetes is a global project and used in enough mission-critical environments that support non-prow-related issues should have 24x7 on-call rotation (activable by any Kubernetes member) with a primary and standby.
What are the limitations that would prevent a call for volunteers to go out to augment the existing team that is doing their very best but are ultimately PST bound and providing support on a best-effort basis?
I know there was a call for general volunteers and I am not sure how successful that was, what I am referring to is a specific call for people to be added to an official on-call rotation (after meeting pre-determined criteria, vetting, and continued participation in the wg, etc.. )
@kubernetes/steering-committee
@moshloop The tl;dr is that if you want this, show up and do the work to help us get to where we need to be. I want this same thing, but can come off as pessimistic when I laundry-list like this, so maybe @dims or @thockin have a more optimistic take.
My experience is that "Congratulations, you're on-call! Figure out what that means and how to do it. Oh and you can break things really badly if you're not careful. Good luck!" does not typically go over well.
What are the limitations that would prevent a call for volunteers to go out to augment the existing team that is doing their very best but are ultimately PST bound and providing support on a best-effort basis?
Trust:
- that people can't break critical aspects of the project
- that people can't misappropriate funding (aka "won't mine bitcoin")
- that since "can't" requires engineering, people with elevated privileges "won't"
- if critical aspects of the project are broken, people are committed to resolving it
- that any potential PII is handled according to whatever guidelines CNCF dictates
Bandwidth to achieve that:
- reduce the scope/blast-radius of individuals and changes
- tightening down overprivileged access takes time
- gaining trust in our automation takes time (it's bash, there is barely any testing, a wrong typo could result in lots of unplanned work)
- reviewing changes to our automation to improve the above takes time
- ensuring disaster recovery plans and capabilities are in place takes time
- setup the "infra" to allow onboarding volunteers
- documenting rules of engagement and policies takes time
- setting up the mechanics for on-call takes time
- herding/onboarding/organizing an influx of new members into roles takes time
after meeting pre-determined criteria, vetting, and continued participation in the wg, etc.. )
So I think the above describes work to be done to make the first two points possible/lower-barrier. It is work that people other than WG leads can do; we need to review it, but consistently showing up and doing the work will establish trust, and reduce our review burden.
I know there was a call for general volunteers and I am not sure how successful that was
To be honest, my impression is that a lot of people showed up at first, and were very interested. Then it quickly become clear that the "boring" sort of work I listed above was what we actually needed to unblock on overburdened WG leads.
To turn the question back around, what do you think next steps would be to get to 24x7 on-call? What would the scope of that on-call be?
For what it's worth, I'd love to see this be more formalized. But I agree with Aaron - it's not as simple as getting people to sign up. This is a somewhat under-serviced WG right now. We're incrementally, slowly, piece by piece getting better, but part of that is also revisiting prior decisions with new context.
When it was just me editing the bash, it was fine. Now that we have a) a lot more bash and b) more people modifying it, the shortcomings of that decision become clearer.
In addition to that, we don't have great "playbooks" in large part "because nobody has written them". I can't, in good conscience, ask people to take responsibility for that. It has to start with people helping bootstrap the whole endeavour.
Thanks for the feedback - I will draft a working document for discussion at the next meeting.
Work in progress is here: https://docs.google.com/document/d/1ih_h-0qnX8B0VjtoE1a6YjF8VTimoZ1jm_U78wnyu0g/edit
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
/remove-lifecycle rotten lifecycle frozen
/priority important-longterm
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten /lifecycle frozen
I don't see any progress or any improvement regarding this.
Closing for now. /close
@ameukam: Closing this issue.
In response to this:
I don't see any progress or any improvement regarding this.
Closing for now. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.