clarify security guidance around user namespaces
As noted in #3318, the statement that "Bottlerocket does not currently support user namespaces" can reasonably be interpreted to mean that user namespaces aren't supported. 😀
Need to rewrite it to mention what's specifically not supported - running orchestrated containers in a new user namespace - or else just drop it altogether.
Any updates on this? @bcressey
Currently Sysbox is one way to run more privileged containers without having to run a pod in a privileged mode. But with the lack of support from Sysbox when it comes to AL2, AL2023 and Bottlerocket and the highly complex installation process of Sysbox, a more stable solution that relies on Linux 6.3, Kubernetes 1.30 and newer versions of containerd & runc would be nice. I hope that we can finally get the ability to run rootless Docker and Podman inside Kubernetes for CI/CD purposes.
User namespaces achieved beta status in Kubernetes 1.30.
Since we're approaching EOL for 1.29 in EKS (March 23, 2025), we'd like to see a good solution to the problem raised in #3318 , in the not too distant future.
@bcressey my security team has expressed a desire to enforce user namespaces (spec.hostUsers: false) on all workloads moving forward, so I'm looking to understand what's necessary from Bottlerocket point of view.
This newly released article from CNCF implies that it simply requires Linux Kernel 6.3 or greater, but there is different conflicting information within the Bottlerocket docs currently.
Can I get an update on where this issue is at, and if the user namespaces described in the article is possible with Bottlerocket?
@adrianmace user namespaces should be working in Bottlerocket's *-k8s-1.33 variants, which include a new enough kernel (6.12) and a kubelet that enables the user namespaces feature gate by default.
User namespaces are still off by default so you'll need to enable them via user-data:
[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"
@bcressey Will this setting be available in Auto Mode?
Sorry for the thread necromancy - but wanted to ask about user namespace support from the perspective of applications inside a pod. I'm no selinux guru, but after reading through the policies in bottlerocket, it looks like the only way to allow a pod to create a user namespace is to mark it as privileged. User namespaces can be really useful, especially from the perspective of some tools like rootless docker, and it'd be nice to have some way of enabling them on the host, without setting privileged - which defeats the point.
Is this something that bottlerocket would intend to support, or is access to user namespaces from within the pod always going to be a no-go?
Alternatively - if we run the pod as privileged, but within a user namespace (hostUser: false), are we still getting some guarantees of safety?
I'm no selinux guru, but after reading through the policies in bottlerocket, it looks like the only way to allow a pod to create a user namespace is to mark it as privileged.
Bottlerocket's SELinux policy doesn't restrict creation of user namespaces, though they're still disabled by default (user.max_user_namespaces = 0).
Running the pod with hostUsers: false (to place it in its own user namespace) and CAP_SYS_ADMIN (to give it the ability to mount a subset of filesystems, or to create its own user namespaces via unshare) should work.
However, by default Kubernetes will only delegate 65k UIDs and GIDs to the pod, which would not be enough for applications like podman, buildkit, and docker to subsequently allocate a different range of UIDs and GIDs to processes within that namespace.
It looks like there's a setting for that though: idsPerPod. If Bottlerocket allowed you to configure that, or if it auto-calculated a value based on max pods, then pods launched in a user namespace would have a larger range in their mappings, and could then delegate parts of that range to their own child processes.
If we run the pod as privileged, but within a user namespace (hostUser: false), are we still getting some guarantees of safety?
Here's my take on most to least secure:
- Running a process within a user namespace with a different range of UIDs and GIDs, and without any capabilities.
- Running a process without any capabilities in the host namespace.
- Running a process within a user namespace with capabilities like
CAP_SYS_ADMIN. - Running a process with capabilities like
CAP_SYS_ADMINin the host namespace.
If local privilege escalation is a pressing concern, you will have the fewest bad days per year with "1", a few more with "2", and a really bad year with either "3" and "4".
The best case scenario for "3" is that you use the capabilities to sandbox the untrusted process so it ends up in "1", just with more steps along the way.
If you can't do that because Bottlerocket blocks it, that's a bug that we should fix. But if you don't try to sandbox it, and it's just a privileged process inside a user namespace, then I don't personally see that as a meaningful increase in security.
Thanks for the thoughts! You're right that it wasn't selinux. I was on a node that had user.max_user_namespaces set to a nonzero value, but I found that I also had to loosen the seccomp profile, which seemed to restrict the ability to create user namespaces by default. I'm not familiar with how those policies are determined - so I'm unsure if that's tied to bottlerocket or the kubelet.
I've create a BR NodePool with the following configuration:
[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"
"vm.max_map_count" = "262144"
Yet, when I start a pod with hostUsers: false, the Pod remains stuck in ContainerCreating state. Kubelet throws the following warning:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create network namespace for sandbox "7e0db2de686260e4b2b06900ebdb3aeb1b418be6986a1def7f80f5cf2c460475": failed to start noop process for unshare: fork/exec /proc/self/exe: no space left on device
Running the pod with securityContext.privileged=true doesn't make any difference either.
Worker node info:
System Info:
...
Kernel Version: 6.12.46
OS Image: Bottlerocket OS 1.49.0 (aws-k8s-1.34)
...
Container Runtime Version: containerd://2.1.4+bottlerocket
Kubelet Version: v1.34.0-eks-642f211
Am I doing something wrong?
That looks correct @realvz
I ran into the userns issue last week when setting up buildkit rootless. Only needed to add settings.kernel.sysctl.'user.max_user_namespaces' = "16384" same as you have.
It did require disabling seccomp & apparmor, maybe that will help?
I disabled seccomp (and apparmor for good measure) but still couldn't get past the error.
Here's my manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-namespace-test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: user-namespace-test
template:
metadata:
labels:
app: user-namespace-test
spec:
hostUsers: false
containers:
- name: test-container
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "while true; do echo 'User ID:' $(id -u) 'Group ID:' $(id -g); sleep 30; done"]
securityContext:
runAsUser: 1000
runAsGroup: 1000
seccompProfile:
type: Unconfined
appArmorProfile:
type: Unconfined
Hmm just ran your snippet and it works for me. The only difference I can think of is that I'm still running 1.33. (containerd://2.0.6+bottlerocket & v1.33.4-eks-e386d34)
What I have for bottlerocket kernel settings
apiclient get settings.kernel
{
"settings": {
"kernel": {
"sysctl": {
"user.max_user_namespaces": "16384"
}
}
}
}
Thanks @andrew-aiken! I successfully ran pods with hostUsers = false on BR 1.33 nodes. After upgrading both
the cluster and node group to 1.34, it continues working as expected. Puzzling why it didn't work earlier with the exact same setup.
Additionally, I didnt have to loosen the seccomp policy. Here's my pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-namespace-test
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: user-namespace-test
template:
metadata:
labels:
app: user-namespace-test
spec:
hostUsers: false
containers:
- name: test-container
image: nginx:latest
command: ["/bin/sh"]
args: ["-c", "while true; do echo 'User ID:' $(id -u) 'Group ID:' $(id -g); sleep 30; done"]
securityContext:
runAsUser: 1000
runAsGroup: 1000