enhancements
enhancements copied to clipboard
KEP-2413: Graduate SeccompDefault to stable
- One-line PR description: We now graduate
SeccompDefault
to stable by basically making the feature enabled by default. - Issue link: https://github.com/kubernetes/enhancements/issues/2413
- Other comments: PTAL @mrunalp @kubernetes/sig-node-pr-reviews
/lgtm /assign @derekwaynecarr @mrunalp @dchen1107
/lgtm /approve
@derekwaynecarr ptal
From what I understand, no actual code changes are required as part of this move.
/approve /lgtm
@kubernetes/production-readiness PTAL
Github doesn't let me comment on arbitrary lines, so I'm leaving some general feedback here:
- What is the behavior if you specify the profile
RuntimeDefault
with aprivileged: true
container? I think I'd expect the default to beunconfined
with a privileged pod, but to enforce theRuntimeDefault
profile if I explicitly specify it. It looks like Kubelet will still pass RuntimeDefault to the CRI in this case. What do containerd & crio do in this case? - It's unfortunate that this approach doesn't work well with Pod Security Admission. Without global cluster configuration and/or visibility, the restricted PSS profile will still need to require an explicit Seccomp profile. Any ideas for addressing this?
From a PRR perspective, this LGTM.
Hey @tallclair, thank you for raising those questions!
- What is the behavior if you specify the profile
RuntimeDefault
with aprivileged: true
container? I think I'd expect the default to beunconfined
with a privileged pod, but to enforce theRuntimeDefault
profile if I explicitly specify it. It looks like Kubelet will still pass RuntimeDefault to the CRI in this case. What do containerd & crio do in this case?
CRI-O and containerd will still apply unconfined (aka nothing), because the seccomp paths are guarded by the privileged check:
https://github.com/cri-o/cri-o/blob/cda07c8ac0b2b53397a286e7b540394c9f1358db/server/container_create_linux.go#L622-L631
https://github.com/containerd/containerd/blob/6ed24c88ed0530aa8d4483b13a217172365f61cc/pkg/cri/server/container_create_linux.go#L464-L467
- It's unfortunate that this approach doesn't work well with Pod Security Admission. Without global cluster configuration and/or visibility, the restricted PSS profile will still need to require an explicit Seccomp profile. Any ideas for addressing this?
From a PRR perspective, this LGTM.
I was thinking back and forth about this feature about making it API aware. The fact that we designed it as kubelet feature makes it easily configurable for cluster admins on a per-node basis, but also less visible to end users. If this feature is stable and we have a good user base adopting it (let's say 2-3 released), then we could think about introducing a similar feature in the API server.
Thanks. /lgtm
Just to confirm, with this graduating to stable, users will still need to opt-in by configuring their kubelets with --seccomp-default
? Is it correct?
I ask since I am aware of some performance regressions when running with seccomp enabled with some workloads, specifically with spectre mitigations. There were some discussions around disabling those mitigations with seccomp on the container runtime side, but I don't think they were completed. Namely:
- https://github.com/containers/common/pull/938
- https://github.com/opencontainers/runc/pull/2433
- https://github.com/opencontainers/runtime-spec/pull/1047
It also looks like Linux 5.16 has changed some of the defaults regarding seccomp with this patch (xref: https://lore.kernel.org/lkml/[email protected]/)
I've been recommending to everyone to use "spec_store_bypass_disable=prctl spectre_v2_user=prctl" for a while now. I already recommend to Yifei too a few months ago when he first found out of the huge seccomp regression when he upgraded his codebase to the upstream kernel with both STIBP/SSBD enabled in seccomp jails.
I haven't tested the perf changes with that patch, but it may be something we should document, as there is no stable kernel release with that patch yet, so most users will not have it and possibly have performance degradations as a result.
/approve /hold
holding for and answer to @bobbypage's question. I don't think an already fixed performance degradation changes the PRR answers, but it's worth addressing.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: deads2k, derekwaynecarr, mrunalp, saschagrunert
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~keps/prod-readiness/OWNERS~~ [deads2k]
- ~~keps/sig-node/OWNERS~~ [derekwaynecarr]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Just to confirm, with this graduating to stable, users will still need to opt-in by configuring their kubelets with
--seccomp-default
? Is it correct?
No, the main goal of the graduation is to switch --seccomp-default=true
per default:
https://github.com/kubernetes/enhancements/pull/3718/files#diff-0509bda37cb3b099fee004ae9dfcaf2029ce1370d81f3f3931a67dff8ec26997R189
Users should now configure their kubelets to opt-out either by using the feature gate or the configuration option if they don't want to use it.
I haven't tested the perf changes with that patch, but it may be something we should document, as there is no stable kernel release with that patch yet, so most users will not have it and possibly have performance degradations as a result.
I agree we should document the performance impact of seccomp on k/website, I can take care of that during the graduation of the feature.
I am :+1: on enabling by default. This is the performance vs. security tradeoff and we can choose security while giving admins opt out if they prefer performance. (A bunch of CVEs would have been avoided if seccomp were enabled by default.) cc: @derekwaynecarr
I am ok with the default on from the security perspective. But as pointed by @bobbypage at https://github.com/kubernetes/enhancements/pull/3718#issuecomment-1414343045, we should
- Documented the performance impact and how to disable the feature
- Mention the potential performance impact on the release note
Some work that went into runc for perf - https://github.com/opencontainers/runc/pull/3588
From the kernel link:
- runc/crun already set SECCOMP_FILTER_FLAG_SPEC_ALLOW by default, k8s
and podman have a default json seccomp allowlist that cannot be
slowed down, so for the #1 seccomp user this change is already a
noop.
Thanks all for discussion. I performed a bit of adhoc testing:
$ gcloud compute instances create cos-dev --machine-type=n2-standard-4 --image-family="cos-101-lts" --image-project="cos-cloud" --zone=us-central1-c
cos-dev ~ # uname -a
Linux cos-dev2 5.15.65+ #1 SMP Sat Jan 21 10:12:05 UTC 2023 x86_64 Intel(R) Xeon(R) CPU @ 2.80GHz GenuineIntel GNU/Linux
cos-dev ~ # cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=44456f0e9d2cd7a9616fb0d05bc4020237839a5a
GOOGLE_CRASH_ID=Lakitu
VERSION=101
VERSION_ID=101
BUILD_ID=17162.40.56
This is latest COS OS, with latest 5.15 stable kernel. I ran pybench benchmark as recommend in this article. I'm not sure how representative this benchmark is, but it seems like a common python bechmark:
## With seccomp
cos-dev1 ~ # docker run --rm -v `pwd`:/root/results ljishen/pybench bench /root/results/output
Totals: 2773ms 2798ms
## Disable Seccomp
cos-dev1 ~ # docker run --rm --security-opt seccomp=unconfined -v `pwd`:/root/results ljishen/pybench bench /root/results/output
Totals: 1964ms 1981ms
So with seccomp enabled out of the box, the perf impact is pretty significant, ~ 1.5x worse.
Applying the kernel command line changes as was done in the above kernel patch of setting: (spec_store_bypass_disable=prctl spectre_v2_user=prctl
):
## With seccomp, but spectre mitigations disabled
cos-dev1 ~ # docker run --rm -v `pwd`:/root/results ljishen/pybench bench /root/results/output
Totals: 1955ms 1975ms
The results are much better and equivalent to running without seccomp before.
The kernel patch will change defaults to make it the default in newer kernels but it was not back ported to the latest 5.15 kernel yet. The runc patches will also accomplish the same effect of disabling those mitigations, but will require folks to use the newer runc.
I'm also supportive to enable this for security improvements, and it's great to see we have mitigation in place for perf impact, but we need to communicate it clearly to users that they should either:
(1) update to newer kernel with that patch (2) change their kernel command line (3) update to latest runc
... to ensure they will not be impacted by any possible perf regressions.
Thank you for all the inputs, I added a new section to the "Beta to GA Graduation" to contain all the documentation bits.
@mrunalp @bobbypage @derekwaynecarr @dchen1107 can we merge this one now? I don't see any outstanding points left.
/lgtm
@deads2k Are you okay removing the hold?
@deads2k Are you okay removing the hold?
Yes, thank you for the updates.
/hold cancel