enhancements KEP-3169: Fine-grained SupplementalGroups control

One-line PR description: the first draft of the KEP. This proposes a new API surface to control how supplemental groups are applied in the container

Issue link: https://github.com/kubernetes/enhancements/issues/3619
k/k issue: https://github.com/kubernetes/kubernetes/issues/112879

Other comments: This needs to update CRI API and will impact CRI implementations. I'm not sure how I proceed to implement it. I really appreciated some comments/guidance from active/experienced contributors.

Oct 17 '22 04:10 everpeace

discussed in Nov 29 sig-node call, @mrunalp will take a pass.

/assign @mrunalp

Nov 29 '22 18:11 derekwaynecarr

Humm, I just tested containerd by hand, and I do not see this:

# nerdctl run --net=none -u 65534 ubuntu id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)

# nerdctl run --net=none -u 65534 --group-add 1 --group-add 2 ubuntu id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup),1(daemon),2(bin)

# nerdctl run --net=none -u 65534:10 --group-add 1 --group-add 2 ubuntu id
uid=65534(nobody) gid=10(uucp) groups=10(uucp),1(daemon),2(bin)

It looks like it is correct - what runtime did you verify the issue on? Can you show the MINIMAL repro that demonstrates the problem?

Jan 30 '23 01:01 thockin

It looks like it is correct - what runtime did you verify the issue on?

Thanks for the check on containerd. I think your check indicates the current runAsGroup Kubernetes API's behavior is not equivalent to -u xxx:<gid>.

Can you show the MINIMAL repro that demonstrates the problem?

I think the problem lies in CRI implementation (If I understood correctly, nerdctl does call contained API directly, does not call CRI, right?). Repro is here. Please take a look.

Jan 30 '23 06:01 everpeace

Do containerd and CRI-O owners think the current behavior is a feature (because they interpret k8s as demanding it) or a bug (oops)?

Not that I think history matters here... when it was added back in august 2017, my read is that it was adding extra supplemental groups https://github.com/kubernetes-sigs/cri-tools/pull/127 That it may be seen as ambiguous and should be a filter was missed. I remember discussions around needing to add extra supplemental groups. In hindsight, I can see how one would expect these to be an overwrite list not an extra list. Maybe ask Lantau or Pengfei if they remember this one.

Either way we are here and we need a change to achieve secure by default. Would prefer to do it without a breaking change if possible, but will understand if the edict is to hard code the change in behavior and cherry pick the change as far back as possible. If going that route an admission controller or runtime hook could be used to add container specified supplemental group(s) when supplemental group(s) are specified in the pod/container spec. We could also list removed supplemental groups as an annotation to make writing said hook a bit easier.

Feb 01 '23 19:02 mikebrow

Not that I think history matters here... when it was added back in august 2017, my read is that it was adding extra supplemental groups kubernetes-sigs/cri-tools#127 That it may be seen as ambiguous and should be a filter was missed. I remember discussions around needing to add extra supplemental groups. In hindsight, I can see how one would expect these to be an overwrite list not an extra list. Maybe ask Lantau or Pengfei if they remember this one.

I think history does matter - if we did this for a reason, we should know that. It mskes the "to break or not to break" decision more concrete. My position isn't that one model is notably better than the other (I can argue both), but more "principlpe of least surprise" and since Docker, OCI, and conatainerd all do it one way, it seems least surprising for k8s to do the same.

But it doesn't and we need to factor that into the calculus now.

@Random-Liu @feiskyer ?

Either way we are here and we need a change to achieve secure by default.

If "secure" means "follow OCI spec", then the only paths to secure-by-default are 1) breaking change; 2) codify it (this KEP) and spin core/v2 with a different default.

IMO we need to do that (core/v2) eventually - we have a number of these sorts of default-changes we would like to make. It's just a megaton of work to do it, and has ripple into ancillary APIs like Deployment and RS, which means doing apps/v2 also. And that's still "opt-in", even if we talk it up and make it the CLI default - there's a WHOLE LOT of "core/v1" YAML out there that won't change.

Would prefer to do it without a breaking change if possible, but will understand if the edict is to hard code the change in behavior and cherry pick the change as far back as possible.

I had not considered a cherry pick, but...yeah, good question.

If going that route an admission controller or runtime hook could be used to add container specified supplemental group(s) when supplemental group(s) are specified in the pod/container spec. We could also list removed supplemental groups as an annotation to make writing said hook a bit easier.

I'm not quite following what you mean about admission?

IF we make the breaking change, it might be interesting to report back the GIDs not used from the container image.

If I am being internally consistent, we should NOT make a breaking change. The odds of someone using this specific set of behaviors for actual desired behavior feels diminuitive, but I don't have data to back that up. I keep hoping someone will turn up and tell me "it's OK, just this once" but I don't think it is.

oss_api_design

Feb 01 '23 20:02 thockin

I think history does matter - if we did this for a reason, we should know that. It makes the "to break or not to break" decision more concrete

👍

Either way we are here and we need a change to achieve secure by default.

If "secure" means "follow OCI spec", then the only paths to secure-by-default are 1) breaking change; 2) codify it (this KEP) and spin core/v2 with a different default.

How about introducing a feature gate (say OCICompatibleSupplementalGroups) that just changes the default behavior in cluster wide? Is that possible and allowed in terms of Kubernetes API policy? I'm not sure how we can do feature gating in CRI layer, though. If possible, we could achieve to roll out the change while satisfying "principlpe of least surprise" by using the standard feature gate's escalation model.

Feb 01 '23 21:02 everpeace

How about introducing a feature gate ... that just changes the default behavior

Feature gates are bounded-lifetime. They have to go away. At that point it's just a breaking change.

Also, we generally don't want defaults decided by flags - it's hard to document and understand.

We could add an admission controller which sets the default value, but I think we're moving away from lots of builtin admission towards things like gatekeeper.

Feb 01 '23 21:02 thockin

Feature gates are bounded-lifetime. They have to go away. At that point it's just a breaking change.

Yeah, I understand what you mean.

IMO we need to do that (core/v2) eventually - we have a number of these sorts of default-changes we would like to make. ... Also, we generally don't want defaults decided by flags - it's hard to document and understand. We could add an admission controller which sets the default value, but I think we're moving away from lots of builtin admission towards things like gatekeeper.

I understand that is why we have "a number of these sorts of default-changes we would like to make". And, I also understand why this is very hard.

If I am being internally consistent, we should NOT make a breaking change. The odds of someone using this specific set of behaviors for actual desired behavior feels diminuitive, but I don't have data to back that up. I keep hoping someone will turn up and tell me "it's OK, just this once" but I don't think it is.

I completely agree this Tim's statement.

Then, would it be reasonable the below?

we add "changing the default of SupplementalGroups" to the backlog for core/v2
- this means we just give up "secure-by-default" in core/v1
we introduce a new API described in the KEP in core/v1.
- it just focuses to provide an API for users to achieve the same behavior with docker/OCI because we can not do this in k8s API level.

WDYT?

Feb 01 '23 21:02 everpeace

I think that's the path (which you were already on, sorry for the detour). I still want to see acknowledgement from containerd and cri-o (in the form of code-comments and/or tests) which make it clear WHY they do this, so it doesn't get fixed later.

Feb 01 '23 22:02 thockin

...

agreed..

If going that route an admission controller or runtime hook could be used to add container specified supplemental group(s) when supplemental group(s) are specified in the pod/container spec. We could also list removed supplemental groups as an annotation to make writing said hook a bit easier.

I'm not quite following what you mean about admission?

I meant what you said later on... some mutating admission controller to set/override the default behavior... (either by flipping a switch or by adding missing, but allowed, supplemental group(s) into the pod spec)

If I am being internally consistent, we should NOT make a breaking change. The odds of someone using this specific set of behaviors for actual desired behavior feels diminuitive, but I don't have data to back that up. I keep hoping someone will turn up and tell me "it's OK, just this once" but I don't think it is.

well said..

If "secure" means "follow OCI spec", then the only paths to secure-by-default are 1) breaking change; 2) codify it (this KEP) and spin core/v2 with a different default.

Nod. Additionally, I first looked at this issue/enhancement request as a pod api thing for using shared supplemental group(s) across the set of containers in the pod. IMO we need to consider when/if it makes sense to add some portion of the pod spec to the OCI standard.

Note: for containerd we are also planning a v2 for our breaking changes (mostly config that will need to be migrated).

Feb 01 '23 22:02 mikebrow

I want someone to do core/v2 - I just don't want it to be me :)

Feb 01 '23 22:02 thockin

An alternative runtime based approach (this isn't portable across k8s distros) but anyways:

Add config to runtimes in 1.26.z (say 1.26.5) and 1.27.0+ to control behavior.
While updating an existing 1.26 cluster to 1.26.5 add configuration that keeps the old behavior. Release note this.
Admins (working with users) can now scan their workloads are going to be affected or not , patch any pods with explicit supplementary groups and then decide to switch to the new behavior in runtime config. The cluster will now stay on the new behavior on further updates.
New 1.27 clusters with 1.27 runtimes get the new secure behavior by default.

Feb 02 '23 00:02 mrunalp

That sounds very fragile to me

On Wed, Feb 1, 2023 at 4:02 PM Mrunal Patel @.***> wrote:

An alternative runtime based approach (this isn't portable across k8s distros) but anyways:

Add config to runtimes in 1.26.z (say 1.26.5) and 1.27.0+ to control behavior.

While updating an existing 1.26 cluster to 1.26.5 add configuration that keeps the old behavior. Release note this.

Admins (working with users) can now scan their workloads are going to be affected or not , patch any pods with explicit supplementary groups and then decide to switch to the new behavior in runtime config. The cluster will now stay on the new behavior on further updates.

New 1.27 clusters with 1.27 runtimes get the new secure behavior by default.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/3620#issuecomment-1412927896, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVCKRA5MHPNKQJLXGSLWVL2S5ANCNFSM6AAAAAARGVXNUU . You are receiving this because you were mentioned.Message ID: @.***>

Feb 02 '23 00:02 thockin

/lgtm

Thanks for your patience - I think the work was worthwhile to really understand it.

Feb 08 '23 23:02 thockin

/approve

Agree that we may need to get some details figured out during the implementation.

Feb 09 '23 00:02 mrunalp

PRR is good to go, but I do need to see SIG approval first.

Feb 09 '23 00:02 johnbelamaric

I am approving this KEP purely based on:

Agreed with the user story and the potential needs of the feature.
Received /lgtms from API (@thockin), and node (@SergeyKanzhelev and @mrunalp)
This is an alpha proposal, and without KEP being merged, the implementation & design discussion might not be prioritized.

But I don't think we agree upon the design yet, and please come to SIG node discussing the design. Thanks!

/approve

Feb 09 '23 00:02 dchen1107

But I don't think we agree upon the design yet, and please come to SIG node discussing the design. Thanks!

I'm confused - what part of this is left unresolved?

Feb 09 '23 00:02 thockin

/approve

Feb 09 '23 00:02 johnbelamaric

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, everpeace, johnbelamaric, mrunalp, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [dchen1107,johnbelamaric]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 09 '23 00:02 k8s-ci-robot

I will squash my commits. Squash will also remove do-not-merge/invalid-commit-message label.

Feb 09 '23 02:02 everpeace

But I don't think we agree upon the design yet, and please come to SIG node discussing the design. Thanks!

I'm in Tokyo. It's very difficult for me to attend the SIG node meeting because of the timezone (the meeting is 3am in my time😭).

I'd also like to know which part is unclear for merging this PR? I'm glad to address/answer it.

Feb 09 '23 02:02 everpeace

Oh, my squash wiped out lgtm label....🙇

@thockin @mrunalp sorry, would you mind giving /lgtm again 🙇?

~@dchen1107 I will keep do-not-merge/hold label once I got your response.~

edited: I noticed dchen1107 already gave /approve to the PR in https://github.com/kubernetes/enhancements/pull/3620#issuecomment-1423448612. So, I'm going to unhold the PR.

Feb 09 '23 02:02 everpeace

/unhold

Feb 09 '23 02:02 everpeace

@thockin @mrunalp I flipped the kep status to 'implementable' as part of tasks to satisfy KEP freeze requirements.

Feb 09 '23 03:02 everpeace

/lgtm

Feb 09 '23 03:02 SergeyKanzhelev

@everpeace

FYI: containerd fixed an issue related to supplemental GIDs. https://github.com/containerd/containerd/security/advisories/GHSA-hmfx-3pcx-653p https://github.com/containerd/containerd/commit/286a01f350a2298b4fdd7e2a0b31c04db3937ea8

I don't think this change affects this KEP, but please confirm.

Feb 16 '23 00:02 AkihiroSuda

Thanks for the info.

I don't think this change affects this KEP, but please confirm.

Confirmed. IIUC, the issue is caused by container runtimes not duplicating the primary group in the supplementary groups in some situations. This means that the primary group MUST always be duplicated into the supplementary groups, right?

Then, although the fix does not affect the KEP as described, I would say we would need to update API description of SupplementalGroups so that it clarifies the behavior.

Feb 16 '23 01:02 everpeace

This means that the primary group MUST always be duplicated into the supplementary groups, right?

Right.

Feb 17 '23 08:02 AkihiroSuda

So this KEP got approved, but no PR merged for 1.27, right?

Will we revisit for 1.28?

Apr 07 '23 19:04 thockin

enhancements enhancements copied to clipboard

KEP-3169: Fine-grained SupplementalGroups control

enhancements
enhancements copied to clipboard