enhancements
enhancements copied to clipboard
KEP-3169: Fine-grained SupplementalGroups control
- One-line PR description: the first draft of the KEP. This proposes a new API surface to control how supplemental groups are applied in the container
- Issue link: https://github.com/kubernetes/enhancements/issues/3619
- k/k issue: https://github.com/kubernetes/kubernetes/issues/112879
- Other comments: This needs to update CRI API and will impact CRI implementations. I'm not sure how I proceed to implement it. I really appreciated some comments/guidance from active/experienced contributors.
discussed in Nov 29 sig-node call, @mrunalp will take a pass.
/assign @mrunalp
Humm, I just tested containerd by hand, and I do not see this:
# nerdctl run --net=none -u 65534 ubuntu id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
# nerdctl run --net=none -u 65534 --group-add 1 --group-add 2 ubuntu id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup),1(daemon),2(bin)
# nerdctl run --net=none -u 65534:10 --group-add 1 --group-add 2 ubuntu id
uid=65534(nobody) gid=10(uucp) groups=10(uucp),1(daemon),2(bin)
It looks like it is correct - what runtime did you verify the issue on? Can you show the MINIMAL repro that demonstrates the problem?
It looks like it is correct - what runtime did you verify the issue on?
Thanks for the check on containerd. I think your check indicates the current runAsGroup
Kubernetes API's behavior is not equivalent to -u xxx:<gid>
.
Can you show the MINIMAL repro that demonstrates the problem?
I think the problem lies in CRI implementation (If I understood correctly, nerdctl does call contained API directly, does not call CRI, right?). Repro is here. Please take a look.
Do containerd and CRI-O owners think the current behavior is a feature (because they interpret k8s as demanding it) or a bug (oops)?
Not that I think history matters here... when it was added back in august 2017, my read is that it was adding extra supplemental groups https://github.com/kubernetes-sigs/cri-tools/pull/127 That it may be seen as ambiguous and should be a filter was missed. I remember discussions around needing to add extra supplemental groups. In hindsight, I can see how one would expect these to be an overwrite list not an extra list. Maybe ask Lantau or Pengfei if they remember this one.
Either way we are here and we need a change to achieve secure by default. Would prefer to do it without a breaking change if possible, but will understand if the edict is to hard code the change in behavior and cherry pick the change as far back as possible. If going that route an admission controller or runtime hook could be used to add container specified supplemental group(s) when supplemental group(s) are specified in the pod/container spec. We could also list removed supplemental groups as an annotation to make writing said hook a bit easier.
Not that I think history matters here... when it was added back in august 2017, my read is that it was adding extra supplemental groups kubernetes-sigs/cri-tools#127 That it may be seen as ambiguous and should be a filter was missed. I remember discussions around needing to add extra supplemental groups. In hindsight, I can see how one would expect these to be an overwrite list not an extra list. Maybe ask Lantau or Pengfei if they remember this one.
I think history does matter - if we did this for a reason, we should know that. It mskes the "to break or not to break" decision more concrete. My position isn't that one model is notably better than the other (I can argue both), but more "principlpe of least surprise" and since Docker, OCI, and conatainerd all do it one way, it seems least surprising for k8s to do the same.
But it doesn't and we need to factor that into the calculus now.
@Random-Liu @feiskyer ?
Either way we are here and we need a change to achieve secure by default.
If "secure" means "follow OCI spec", then the only paths to secure-by-default are 1) breaking change; 2) codify it (this KEP) and spin core/v2 with a different default.
IMO we need to do that (core/v2) eventually - we have a number of these sorts of default-changes we would like to make. It's just a megaton of work to do it, and has ripple into ancillary APIs like Deployment and RS, which means doing apps/v2 also. And that's still "opt-in", even if we talk it up and make it the CLI default - there's a WHOLE LOT of "core/v1" YAML out there that won't change.
Would prefer to do it without a breaking change if possible, but will understand if the edict is to hard code the change in behavior and cherry pick the change as far back as possible.
I had not considered a cherry pick, but...yeah, good question.
If going that route an admission controller or runtime hook could be used to add container specified supplemental group(s) when supplemental group(s) are specified in the pod/container spec. We could also list removed supplemental groups as an annotation to make writing said hook a bit easier.
I'm not quite following what you mean about admission?
IF we make the breaking change, it might be interesting to report back the GIDs not used from the container image.
If I am being internally consistent, we should NOT make a breaking change. The odds of someone using this specific set of behaviors for actual desired behavior feels diminuitive, but I don't have data to back that up. I keep hoping someone will turn up and tell me "it's OK, just this once" but I don't think it is.
I think history does matter - if we did this for a reason, we should know that. It makes the "to break or not to break" decision more concrete
π
Either way we are here and we need a change to achieve secure by default.
If "secure" means "follow OCI spec", then the only paths to secure-by-default are 1) breaking change; 2) codify it (this KEP) and spin core/v2 with a different default.
How about introducing a feature gate (say OCICompatibleSupplementalGroups
) that just changes the default behavior in cluster wide? Is that possible and allowed in terms of Kubernetes API policy? I'm not sure how we can do feature gating in CRI layer, though. If possible, we could achieve to roll out the change while satisfying "principlpe of least surprise" by using the standard feature gate's escalation model.
How about introducing a feature gate ... that just changes the default behavior
Feature gates are bounded-lifetime. They have to go away. At that point it's just a breaking change.
Also, we generally don't want defaults decided by flags - it's hard to document and understand.
We could add an admission controller which sets the default value, but I think we're moving away from lots of builtin admission towards things like gatekeeper.
Feature gates are bounded-lifetime. They have to go away. At that point it's just a breaking change.
Yeah, I understand what you mean.
IMO we need to do that (core/v2) eventually - we have a number of these sorts of default-changes we would like to make. ... Also, we generally don't want defaults decided by flags - it's hard to document and understand. We could add an admission controller which sets the default value, but I think we're moving away from lots of builtin admission towards things like gatekeeper.
I understand that is why we have "a number of these sorts of default-changes we would like to make". And, I also understand why this is very hard.
If I am being internally consistent, we should NOT make a breaking change. The odds of someone using this specific set of behaviors for actual desired behavior feels diminuitive, but I don't have data to back that up. I keep hoping someone will turn up and tell me "it's OK, just this once" but I don't think it is.
I completely agree this Tim's statement.
Then, would it be reasonable the below?
- we add "changing the default of SupplementalGroups" to the backlog for core/v2
- this means we just give up "secure-by-default" in core/v1
- we introduce a new API described in the KEP in core/v1.
- it just focuses to provide an API for users to achieve the same behavior with docker/OCI because we can not do this in k8s API level.
WDYT?
I think that's the path (which you were already on, sorry for the detour). I still want to see acknowledgement from containerd and cri-o (in the form of code-comments and/or tests) which make it clear WHY they do this, so it doesn't get fixed later.
...
agreed..
If going that route an admission controller or runtime hook could be used to add container specified supplemental group(s) when supplemental group(s) are specified in the pod/container spec. We could also list removed supplemental groups as an annotation to make writing said hook a bit easier.
I'm not quite following what you mean about admission?
I meant what you said later on... some mutating admission controller to set/override the default behavior... (either by flipping a switch or by adding missing, but allowed, supplemental group(s) into the pod spec)
If I am being internally consistent, we should NOT make a breaking change. The odds of someone using this specific set of behaviors for actual desired behavior feels diminuitive, but I don't have data to back that up. I keep hoping someone will turn up and tell me "it's OK, just this once" but I don't think it is.
well said..
If "secure" means "follow OCI spec", then the only paths to secure-by-default are 1) breaking change; 2) codify it (this KEP) and spin core/v2 with a different default.
Nod. Additionally, I first looked at this issue/enhancement request as a pod api thing for using shared supplemental group(s) across the set of containers in the pod. IMO we need to consider when/if it makes sense to add some portion of the pod spec to the OCI standard.
Note: for containerd we are also planning a v2 for our breaking changes (mostly config that will need to be migrated).
I want someone to do core/v2 - I just don't want it to be me :)
An alternative runtime based approach (this isn't portable across k8s distros) but anyways:
- Add config to runtimes in 1.26.z (say 1.26.5) and 1.27.0+ to control behavior.
- While updating an existing 1.26 cluster to 1.26.5 add configuration that keeps the old behavior. Release note this.
- Admins (working with users) can now scan their workloads are going to be affected or not , patch any pods with explicit supplementary groups and then decide to switch to the new behavior in runtime config. The cluster will now stay on the new behavior on further updates.
- New 1.27 clusters with 1.27 runtimes get the new secure behavior by default.
That sounds very fragile to me
On Wed, Feb 1, 2023 at 4:02 PM Mrunal Patel @.***> wrote:
An alternative runtime based approach (this isn't portable across k8s distros) but anyways:
- Add config to runtimes in 1.26.z (say 1.26.5) and 1.27.0+ to control behavior.
- While updating an existing 1.26 cluster to 1.26.5 add configuration that keeps the old behavior. Release note this.
- Admins (working with users) can now scan their workloads are going to be affected or not , patch any pods with explicit supplementary groups and then decide to switch to the new behavior in runtime config. The cluster will now stay on the new behavior on further updates.
- New 1.27 clusters with 1.27 runtimes get the new secure behavior by default.
β Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/3620#issuecomment-1412927896, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVCKRA5MHPNKQJLXGSLWVL2S5ANCNFSM6AAAAAARGVXNUU . You are receiving this because you were mentioned.Message ID: @.***>
/lgtm
Thanks for your patience - I think the work was worthwhile to really understand it.
/approve
Agree that we may need to get some details figured out during the implementation.
PRR is good to go, but I do need to see SIG approval first.
I am approving this KEP purely based on:
- Agreed with the user story and the potential needs of the feature.
- Received /lgtms from API (@thockin), and node (@SergeyKanzhelev and @mrunalp)
- This is an alpha proposal, and without KEP being merged, the implementation & design discussion might not be prioritized.
But I don't think we agree upon the design yet, and please come to SIG node discussing the design. Thanks!
/approve
But I don't think we agree upon the design yet, and please come to SIG node discussing the design. Thanks!
I'm confused - what part of this is left unresolved?
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: dchen1107, everpeace, johnbelamaric, mrunalp, SergeyKanzhelev
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
- ~~keps/sig-node/OWNERS~~ [dchen1107,johnbelamaric]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
I will squash my commits. Squash will also remove do-not-merge/invalid-commit-message
label.
But I don't think we agree upon the design yet, and please come to SIG node discussing the design. Thanks!
I'm in Tokyo. It's very difficult for me to attend the SIG node meeting because of the timezone (the meeting is 3am in my timeπ).
I'd also like to know which part is unclear for merging this PR? I'm glad to address/answer it.
Oh, my squash wiped out lgtm
label....π
@thockin @mrunalp sorry, would you mind giving /lgtm
again π?
~@dchen1107 I will keep do-not-merge/hold
label once I got your response.~
edited: I noticed dchen1107 already gave /approve
to the PR in https://github.com/kubernetes/enhancements/pull/3620#issuecomment-1423448612. So, I'm going to unhold the PR.
/unhold
@thockin @mrunalp I flipped the kep status to 'implementable' as part of tasks to satisfy KEP freeze requirements.
/lgtm
@everpeace
FYI: containerd fixed an issue related to supplemental GIDs. https://github.com/containerd/containerd/security/advisories/GHSA-hmfx-3pcx-653p https://github.com/containerd/containerd/commit/286a01f350a2298b4fdd7e2a0b31c04db3937ea8
I don't think this change affects this KEP, but please confirm.
Thanks for the info.
I don't think this change affects this KEP, but please confirm.
Confirmed. IIUC, the issue is caused by container runtimes not duplicating the primary group in the supplementary groups in some situations. This means that the primary group MUST always be duplicated into the supplementary groups, right?
Then, although the fix does not affect the KEP as described, I would say we would need to update API description of SupplementalGroups
so that it clarifies the behavior.
This means that the primary group MUST always be duplicated into the supplementary groups, right?
Right.
So this KEP got approved, but no PR merged for 1.27, right?
Will we revisit for 1.28?