enhancements KEP-2413: Graduate SeccompDefault to stable

One-line PR description: We now graduate SeccompDefault to stable by basically making the feature enabled by default.
Issue link: https://github.com/kubernetes/enhancements/issues/2413
Other comments: PTAL @mrunalp @kubernetes/sig-node-pr-reviews

Jan 10 '23 10:01 saschagrunert

/lgtm /assign @derekwaynecarr @mrunalp @dchen1107

Jan 11 '23 07:01 bart0sh

/lgtm /approve

Jan 26 '23 16:01 mrunalp

@derekwaynecarr ptal

Jan 26 '23 16:01 mrunalp

From what I understand, no actual code changes are required as part of this move.

/approve /lgtm

Jan 31 '23 21:01 derekwaynecarr

@kubernetes/production-readiness PTAL

Feb 01 '23 07:02 saschagrunert

Github doesn't let me comment on arbitrary lines, so I'm leaving some general feedback here:

What is the behavior if you specify the profile RuntimeDefault with a privileged: true container? I think I'd expect the default to be unconfined with a privileged pod, but to enforce the RuntimeDefault profile if I explicitly specify it. It looks like Kubelet will still pass RuntimeDefault to the CRI in this case. What do containerd & crio do in this case?
It's unfortunate that this approach doesn't work well with Pod Security Admission. Without global cluster configuration and/or visibility, the restricted PSS profile will still need to require an explicit Seccomp profile. Any ideas for addressing this?

From a PRR perspective, this LGTM.

Feb 01 '23 22:02 tallclair

Hey @tallclair, thank you for raising those questions!

What is the behavior if you specify the profile RuntimeDefault with a privileged: true container? I think I'd expect the default to be unconfined with a privileged pod, but to enforce the RuntimeDefault profile if I explicitly specify it. It looks like Kubelet will still pass RuntimeDefault to the CRI in this case. What do containerd & crio do in this case?

CRI-O and containerd will still apply unconfined (aka nothing), because the seccomp paths are guarded by the privileged check:

https://github.com/cri-o/cri-o/blob/cda07c8ac0b2b53397a286e7b540394c9f1358db/server/container_create_linux.go#L622-L631

https://github.com/containerd/containerd/blob/6ed24c88ed0530aa8d4483b13a217172365f61cc/pkg/cri/server/container_create_linux.go#L464-L467

It's unfortunate that this approach doesn't work well with Pod Security Admission. Without global cluster configuration and/or visibility, the restricted PSS profile will still need to require an explicit Seccomp profile. Any ideas for addressing this?

From a PRR perspective, this LGTM.

I was thinking back and forth about this feature about making it API aware. The fact that we designed it as kubelet feature makes it easily configurable for cluster admins on a per-node basis, but also less visible to end users. If this feature is stable and we have a good user base adopting it (let's say 2-3 released), then we could think about introducing a similar feature in the API server.

Feb 02 '23 08:02 saschagrunert

Thanks. /lgtm

Feb 02 '23 19:02 tallclair

Just to confirm, with this graduating to stable, users will still need to opt-in by configuring their kubelets with --seccomp-default? Is it correct?

I ask since I am aware of some performance regressions when running with seccomp enabled with some workloads, specifically with spectre mitigations. There were some discussions around disabling those mitigations with seccomp on the container runtime side, but I don't think they were completed. Namely:

https://github.com/containers/common/pull/938
https://github.com/opencontainers/runc/pull/2433
https://github.com/opencontainers/runtime-spec/pull/1047

It also looks like Linux 5.16 has changed some of the defaults regarding seccomp with this patch (xref: https://lore.kernel.org/lkml/[email protected]/)

I've been recommending to everyone to use "spec_store_bypass_disable=prctl spectre_v2_user=prctl" for a while now. I already recommend to Yifei too a few months ago when he first found out of the huge seccomp regression when he upgraded his codebase to the upstream kernel with both STIBP/SSBD enabled in seccomp jails.

I haven't tested the perf changes with that patch, but it may be something we should document, as there is no stable kernel release with that patch yet, so most users will not have it and possibly have performance degradations as a result.

Feb 02 '23 20:02 bobbypage

/approve /hold

holding for and answer to @bobbypage's question. I don't think an already fixed performance degradation changes the PRR answers, but it's worth addressing.

Feb 02 '23 21:02 deads2k

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, derekwaynecarr, mrunalp, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-node/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 02 '23 21:02 k8s-ci-robot

Just to confirm, with this graduating to stable, users will still need to opt-in by configuring their kubelets with --seccomp-default? Is it correct?

No, the main goal of the graduation is to switch --seccomp-default=true per default: https://github.com/kubernetes/enhancements/pull/3718/files#diff-0509bda37cb3b099fee004ae9dfcaf2029ce1370d81f3f3931a67dff8ec26997R189

Users should now configure their kubelets to opt-out either by using the feature gate or the configuration option if they don't want to use it.

I haven't tested the perf changes with that patch, but it may be something we should document, as there is no stable kernel release with that patch yet, so most users will not have it and possibly have performance degradations as a result.

I agree we should document the performance impact of seccomp on k/website, I can take care of that during the graduation of the feature.

Feb 03 '23 08:02 saschagrunert

I am :+1: on enabling by default. This is the performance vs. security tradeoff and we can choose security while giving admins opt out if they prefer performance. (A bunch of CVEs would have been avoided if seccomp were enabled by default.) cc: @derekwaynecarr

Feb 03 '23 17:02 mrunalp

I am ok with the default on from the security perspective. But as pointed by @bobbypage at https://github.com/kubernetes/enhancements/pull/3718#issuecomment-1414343045, we should

Documented the performance impact and how to disable the feature
Mention the potential performance impact on the release note

Feb 03 '23 23:02 dchen1107

Some work that went into runc for perf - https://github.com/opencontainers/runc/pull/3588

Feb 03 '23 23:02 mrunalp

From the kernel link:

- runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s
  and podman have a default json seccomp allowlist that cannot be
  slowed down, so for the #1 seccomp user this change is already a
  noop.

Feb 03 '23 23:02 mrunalp

Thanks all for discussion. I performed a bit of adhoc testing:

$ gcloud compute instances create cos-dev --machine-type=n2-standard-4     --image-family="cos-101-lts"     --image-project="cos-cloud"     --zone=us-central1-c

cos-dev ~ # uname -a
Linux cos-dev2 5.15.65+ #1 SMP Sat Jan 21 10:12:05 UTC 2023 x86_64 Intel(R) Xeon(R) CPU @ 2.80GHz GenuineIntel GNU/Linux

cos-dev ~ # cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=44456f0e9d2cd7a9616fb0d05bc4020237839a5a
GOOGLE_CRASH_ID=Lakitu
VERSION=101
VERSION_ID=101
BUILD_ID=17162.40.56

This is latest COS OS, with latest 5.15 stable kernel. I ran pybench benchmark as recommend in this article. I'm not sure how representative this benchmark is, but it seems like a common python bechmark:

## With seccomp
cos-dev1 ~ # docker run --rm     -v `pwd`:/root/results     ljishen/pybench     bench /root/results/output
Totals:                           2773ms   2798ms
 
## Disable Seccomp
cos-dev1 ~ # docker run --rm  --security-opt seccomp=unconfined   -v `pwd`:/root/results     ljishen/pybench     bench /root/results/output
Totals:                           1964ms   1981ms

So with seccomp enabled out of the box, the perf impact is pretty significant, ~ 1.5x worse.

Applying the kernel command line changes as was done in the above kernel patch of setting: (spec_store_bypass_disable=prctl spectre_v2_user=prctl):

## With seccomp, but spectre mitigations disabled
cos-dev1 ~ # docker run --rm     -v `pwd`:/root/results     ljishen/pybench     bench /root/results/output
Totals:                           1955ms   1975ms

The results are much better and equivalent to running without seccomp before.

The kernel patch will change defaults to make it the default in newer kernels but it was not back ported to the latest 5.15 kernel yet. The runc patches will also accomplish the same effect of disabling those mitigations, but will require folks to use the newer runc.

I'm also supportive to enable this for security improvements, and it's great to see we have mitigation in place for perf impact, but we need to communicate it clearly to users that they should either:

(1) update to newer kernel with that patch (2) change their kernel command line (3) update to latest runc

... to ensure they will not be impacted by any possible perf regressions.

Feb 04 '23 00:02 bobbypage

Thank you for all the inputs, I added a new section to the "Beta to GA Graduation" to contain all the documentation bits.

Feb 06 '23 08:02 saschagrunert

@mrunalp @bobbypage @derekwaynecarr @dchen1107 can we merge this one now? I don't see any outstanding points left.

Feb 07 '23 13:02 saschagrunert

/lgtm

Feb 07 '23 19:02 mrunalp

@deads2k Are you okay removing the hold?

Feb 07 '23 19:02 mrunalp

@deads2k Are you okay removing the hold?

Yes, thank you for the updates.

/hold cancel

Feb 08 '23 21:02 deads2k

enhancements enhancements copied to clipboard

KEP-2413: Graduate SeccompDefault to stable

enhancements
enhancements copied to clipboard