bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

clarify security guidance around user namespaces

Open bcressey opened this issue 2 years ago • 12 comments

As noted in #3318, the statement that "Bottlerocket does not currently support user namespaces" can reasonably be interpreted to mean that user namespaces aren't supported. 😀

Need to rewrite it to mention what's specifically not supported - running orchestrated containers in a new user namespace - or else just drop it altogether.

bcressey avatar Aug 04 '23 20:08 bcressey

Any updates on this? @bcressey

Currently Sysbox is one way to run more privileged containers without having to run a pod in a privileged mode. But with the lack of support from Sysbox when it comes to AL2, AL2023 and Bottlerocket and the highly complex installation process of Sysbox, a more stable solution that relies on Linux 6.3, Kubernetes 1.30 and newer versions of containerd & runc would be nice. I hope that we can finally get the ability to run rootless Docker and Podman inside Kubernetes for CI/CD purposes.

User namespaces achieved beta status in Kubernetes 1.30.

Since we're approaching EOL for 1.29 in EKS (March 23, 2025), we'd like to see a good solution to the problem raised in #3318 , in the not too distant future.

somebadcode avatar Nov 15 '24 13:11 somebadcode

@bcressey my security team has expressed a desire to enforce user namespaces (spec.hostUsers: false) on all workloads moving forward, so I'm looking to understand what's necessary from Bottlerocket point of view.

This newly released article from CNCF implies that it simply requires Linux Kernel 6.3 or greater, but there is different conflicting information within the Bottlerocket docs currently.

Can I get an update on where this issue is at, and if the user namespaces described in the article is possible with Bottlerocket?

adrianmace avatar Jul 17 '25 00:07 adrianmace

@adrianmace user namespaces should be working in Bottlerocket's *-k8s-1.33 variants, which include a new enough kernel (6.12) and a kubelet that enables the user namespaces feature gate by default.

User namespaces are still off by default so you'll need to enable them via user-data:

[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"

bcressey avatar Jul 23 '25 21:07 bcressey

@bcressey Will this setting be available in Auto Mode?

jicowan avatar Aug 08 '25 15:08 jicowan

Sorry for the thread necromancy - but wanted to ask about user namespace support from the perspective of applications inside a pod. I'm no selinux guru, but after reading through the policies in bottlerocket, it looks like the only way to allow a pod to create a user namespace is to mark it as privileged. User namespaces can be really useful, especially from the perspective of some tools like rootless docker, and it'd be nice to have some way of enabling them on the host, without setting privileged - which defeats the point. Is this something that bottlerocket would intend to support, or is access to user namespaces from within the pod always going to be a no-go?

Alternatively - if we run the pod as privileged, but within a user namespace (hostUser: false), are we still getting some guarantees of safety?

alexnovak avatar Sep 12 '25 14:09 alexnovak

I'm no selinux guru, but after reading through the policies in bottlerocket, it looks like the only way to allow a pod to create a user namespace is to mark it as privileged.

Bottlerocket's SELinux policy doesn't restrict creation of user namespaces, though they're still disabled by default (user.max_user_namespaces = 0).

Running the pod with hostUsers: false (to place it in its own user namespace) and CAP_SYS_ADMIN (to give it the ability to mount a subset of filesystems, or to create its own user namespaces via unshare) should work.

However, by default Kubernetes will only delegate 65k UIDs and GIDs to the pod, which would not be enough for applications like podman, buildkit, and docker to subsequently allocate a different range of UIDs and GIDs to processes within that namespace.

It looks like there's a setting for that though: idsPerPod. If Bottlerocket allowed you to configure that, or if it auto-calculated a value based on max pods, then pods launched in a user namespace would have a larger range in their mappings, and could then delegate parts of that range to their own child processes.

If we run the pod as privileged, but within a user namespace (hostUser: false), are we still getting some guarantees of safety?

Here's my take on most to least secure:

  1. Running a process within a user namespace with a different range of UIDs and GIDs, and without any capabilities.
  2. Running a process without any capabilities in the host namespace.
  3. Running a process within a user namespace with capabilities like CAP_SYS_ADMIN.
  4. Running a process with capabilities like CAP_SYS_ADMIN in the host namespace.

If local privilege escalation is a pressing concern, you will have the fewest bad days per year with "1", a few more with "2", and a really bad year with either "3" and "4".

The best case scenario for "3" is that you use the capabilities to sandbox the untrusted process so it ends up in "1", just with more steps along the way.

If you can't do that because Bottlerocket blocks it, that's a bug that we should fix. But if you don't try to sandbox it, and it's just a privileged process inside a user namespace, then I don't personally see that as a meaningful increase in security.

bcressey avatar Sep 12 '25 22:09 bcressey

Thanks for the thoughts! You're right that it wasn't selinux. I was on a node that had user.max_user_namespaces set to a nonzero value, but I found that I also had to loosen the seccomp profile, which seemed to restrict the ability to create user namespaces by default. I'm not familiar with how those policies are determined - so I'm unsure if that's tied to bottlerocket or the kubelet.

alexnovak avatar Sep 16 '25 18:09 alexnovak

I've create a BR NodePool with the following configuration:

[settings.kernel.sysctl]
"user.max_user_namespaces" = "16384"
"vm.max_map_count" = "262144"

Yet, when I start a pod with hostUsers: false, the Pod remains stuck in ContainerCreating state. Kubelet throws the following warning:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create network namespace for sandbox "7e0db2de686260e4b2b06900ebdb3aeb1b418be6986a1def7f80f5cf2c460475": failed to start noop process for unshare: fork/exec /proc/self/exe: no space left on device

Running the pod with securityContext.privileged=true doesn't make any difference either.

Worker node info:

System Info:
  ...
  Kernel Version:             6.12.46
  OS Image:                   Bottlerocket OS 1.49.0 (aws-k8s-1.34)
  ...
  Container Runtime Version:  containerd://2.1.4+bottlerocket
  Kubelet Version:            v1.34.0-eks-642f211

Am I doing something wrong?

realvz avatar Oct 14 '25 01:10 realvz

That looks correct @realvz

I ran into the userns issue last week when setting up buildkit rootless. Only needed to add settings.kernel.sysctl.'user.max_user_namespaces' = "16384" same as you have.

It did require disabling seccomp & apparmor, maybe that will help?

andrew-aiken avatar Oct 14 '25 01:10 andrew-aiken

I disabled seccomp (and apparmor for good measure) but still couldn't get past the error.

Here's my manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-namespace-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: user-namespace-test
  template:
    metadata:
      labels:
        app: user-namespace-test
    spec:
      hostUsers: false
      containers:
      - name: test-container
        image: nginx:latest
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo 'User ID:' $(id -u) 'Group ID:' $(id -g); sleep 30; done"]
        securityContext:
          runAsUser: 1000
          runAsGroup: 1000
          seccompProfile:
            type: Unconfined
          appArmorProfile:
            type: Unconfined

realvz avatar Oct 14 '25 04:10 realvz

Hmm just ran your snippet and it works for me. The only difference I can think of is that I'm still running 1.33. (containerd://2.0.6+bottlerocket & v1.33.4-eks-e386d34)

What I have for bottlerocket kernel settings

apiclient get settings.kernel
{
  "settings": {
    "kernel": {
      "sysctl": {
        "user.max_user_namespaces": "16384"
      }
    }
  }
}

andrew-aiken avatar Oct 14 '25 13:10 andrew-aiken

Thanks @andrew-aiken! I successfully ran pods with hostUsers = false on BR 1.33 nodes. After upgrading both the cluster and node group to 1.34, it continues working as expected. Puzzling why it didn't work earlier with the exact same setup.

Additionally, I didnt have to loosen the seccomp policy. Here's my pod:

 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: user-namespace-test
   namespace: default
 spec:
   replicas: 1
   selector:
     matchLabels:
       app: user-namespace-test
   template:
     metadata:
       labels:
         app: user-namespace-test
     spec:
       hostUsers: false
       containers:
       - name: test-container
         image: nginx:latest
         command: ["/bin/sh"]
         args: ["-c", "while true; do echo 'User ID:' $(id -u) 'Group ID:' $(id -g); sleep 30; done"]         
         securityContext:
           runAsUser: 1000
           runAsGroup: 1000

realvz avatar Oct 14 '25 19:10 realvz