lxd security.syscalls.intercept.mknod no longer seems to be working for docker [ nested ? ] use cases

LXD: latest/edge [ ver: 6.2-bde4d03 , rev: 6.2-bde4d03 ] OS: core24

As per per the subject:

# lxc config show -e test-container |grep security
  security.nesting: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
#

# lxc exec test-container /bin/bash
# docker run -it --rm --privileged busybox mknod /root/test c 1 3
mknod: /root/test: Operation not permitted
#

If I enable priv mode, and restart test-container:

# lxc config show -e test-container |grep security
  security.nesting: "true"
  security.privileged: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
#

# lxc exec test-container /bin/bash
# docker run -it --rm --privileged busybox mknod /root/test c 1 3
#

Outside of docker it will seems to work:

# lxc config show -e test-container |grep security
  security.nesting: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
#

# lxc exec test-container /bin/bash
# mknod /root/test c 1 3
#

This actually came to my attention because it seems that some image layer creation can require mknod now in newer version of docker. So a simple image pull can fail. It seems to be image dependent, but not sure what the actual trigger is. I don't know of any public images that container this issue I can share.

It image problem could be related to Native Overlay Diff: true in overlay2, but that's a complete guess on my part. If I downgrade docker to a version that doesn't enable that with overlay2 [ from 27.2.0 to 24.0.5 ] with the zfs or btrfs backing store, the image related error goes away.

Please let me know if you require any further info.

Thanks!

Jan 24 '25 15:01 jocado

@jocado does this work on 5.0/stable or 5.21/stable versions in your environment?

Jan 24 '25 15:01 tomponline

Indeed we have similar tests here:

https://github.com/canonical/lxd-ci/blob/a5e286cd9bebf2781ae8c64e8b3ffd3c3b0a66b6/tests/interception#L35-L47

Jan 24 '25 15:01 tomponline

@jocado does this work on 5.0/stable or 5.21/stable versions in your environment?

Seems to be the same result.

Jan 24 '25 16:01 jocado

Just to be super clear, it works fine outside of docker still.

Jan 24 '25 16:01 jocado

I did also find another reference to the issue here

Can't be sure it's exactly the same, but I suspect it may be.

Jan 24 '25 16:01 jocado

Maybe one for @mihalicyn to look into when he gets a chance.

Jan 24 '25 16:01 tomponline

The main issue for us is the image loading for some images [ as mentioned above, for unknown reason some image pulls trigger a mknod, although the files referenced are certainly not device files ], but we can mitigate that by sticking on docker 24 for now.

However, we will need to upgrade at some point, and I wouldn't be surprised if another mknod related requirements presents itself eventually, even if we were able to stick on this version of docker for a while.

Anyway, thanks for taking a look 👍

Jan 24 '25 16:01 jocado

Hi @jocado,

thanks for reporting this. I can confirm that problem exists and can be reproduced in unprivileged container with mknod interception turned on. Like this:

$ unshare -m
$ mkdir {work,upper,lower,ovl}
$ mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl
$ mknod /root/ovl/null c 1 3
mknod: /root/ovl/null: Operation not permitted

It is definitely not a regression and this never worked before. Overlayfs+mknod intereception is problematic, because overlayfs internally uses override_creds() and mknod interception only works in a hacky way, when we do a bindmount, instead of real mknod. Your case is even more problematic, as we have extra mount namespace. I've an idea how to fix this, will try and get back to you a little bit later.

This actually came to my attention because it seems that some image layer creation can require mknod now in newer version of docker. So a simple image pull can fail. It seems to be image dependent, but not sure what the actual trigger is. I don't know of any public images that container this issue I can share.

Actually, that would be awesome to know a bit more about this. I wonder how docker version can affect on the behavior here. And also, even if problematic images are private, maybe you can somehow get me a hint about what is going on in a Dockerfile of these images. I mean which images they are based on? And so on... this can help us to potentially find more workarounds for you and other users too.

Kind regards, Alex

Mar 26 '25 19:03 mihalicyn

Hi @mihalicyn

Thanks very much for taking a look 👍

It is definitely not a regression and this never worked before

Right. That may have been an assumption on my part, based on the linked post I referenced above. But, I think the situation is that it used to work in nested containers [ docker inside LXC ] because until recently overlays was not used in that scenario when using either btrfs or zfs as a backing store [ docker was using the vfs driver ] . As you point out above, overlayfs + mknod interception is the problem.

I've an idea how to fix this, will try and get back to you a little bit later.

That sounds great - very much appreciated!

Actually, that would be awesome to know a bit more about this

Let me see if I can dig out more info on that which is shareable. I was also pretty curious about it when I discovered the issue. Leave it with me, and I'll see what I can do.

Thanks!

Mar 26 '25 19:03 jocado

Hi @mihalicyn

As promised, although a bit delayed, I'm back with a bit more info regarding the image load that produced the mknod error.

Although I still can't point you to any public image, I can provide a very simple Dockerfile that can create a container image which reproduces it.

However, it's also worth noting that during my testing of this, I found that the more recent versions of the docker snap do not seem to have this problem with mknod during image load. So I don't know how useful this info will be know.

The Dockerfile:

FROM python:3.9@sha256:5c72dd8986db8c289ad1a6514319c144ed8f72c6dd702873c4358445473b6f54
RUN python -m pip --no-cache-dir install setuptools==80.0.0

If you build the image, save that as an archive, and load it on a test system running the affected versions of docker, you should see the error:

DOCKER_BUILDKIT=1 docker build -t foo . --no-cache
docker save foo:latest | gzip -c > foo.tar.gz
scp {somewhere}

# zcat foo.tar.gz | docker load
0238a1790324: Loading layer [==================================================>]  121.3MB/121.3MB
e6e2ab10dba6: Loading layer [==================================================>]   49.6MB/49.6MB
68866beb2ed2: Loading layer [==================================================>]  181.5MB/181.5MB
21e1c4948146: Loading layer [==================================================>]    597MB/597MB
e077e19b6682: Loading layer [==================================================>]  19.25MB/19.25MB
3aff9f9c9f44: Loading layer [==================================================>]  40.25MB/40.25MB
5e7745c5bee2: Loading layer [==================================================>]   5.12kB/5.12kB
17b02461857a: Loading layer [==================================================>]   10.5MB/10.5MB
e4f69e688303: Loading layer [==================================================>]  15.05MB/15.05MB
failed to mknod("/usr/local/lib/python3.9/site-packages/setuptools-58.1.0.dist-info", S_IFCHR, 0): operation not permitted

I found that these version/revisions of the docker snap have this issue:

27.2.0 - 2963 27.5.1 - 3064

The current version in the beta channel seems to load the image fine:

28.1.1 - 3221

# zcat foo.tar.gz | docker load
0238a1790324: Loading layer [==================================================>]  121.3MB/121.3MB
e6e2ab10dba6: Loading layer [==================================================>]   49.6MB/49.6MB
68866beb2ed2: Loading layer [==================================================>]  181.5MB/181.5MB
21e1c4948146: Loading layer [==================================================>]    597MB/597MB
e077e19b6682: Loading layer [==================================================>]  19.25MB/19.25MB
3aff9f9c9f44: Loading layer [==================================================>]  40.25MB/40.25MB
5e7745c5bee2: Loading layer [==================================================>]   5.12kB/5.12kB
17b02461857a: Loading layer [==================================================>]   10.5MB/10.5MB
e4f69e688303: Loading layer [==================================================>]  15.05MB/15.05MB
Loaded image: foo:latest

I also double checked the general mknod issue with 28.1.1, but as expected it does of course still not work:

# docker run -it --rm --privileged busybox mknod /root/test c 1 3
mknod: /root/test: Operation not permitted

Hope that's at least mildly useful/interesting :)

Did you make any headway testing your potential fix ?

Apr 30 '25 18:04 jocado

@mihalicyn any updates on this one?

May 13 '25 05:05 tomponline

Hey @jocado,

thanks for providing this information. It is invaluable!

failed to mknod("/usr/local/lib/python3.9/site-packages/setuptools-58.1.0.dist-info", S_IFCHR, 0): operation not permitted

This is extremely weird, because it fails to create a whiteout device. Whiteout devices are available inside the container. Please, can you try to check if it works without security.syscalls.intercept.mknod being set to true. You can just do lxc config unset myct security.syscalls.intercept.mknod and check if this image start to work properly. (btw, I think I've just found a bug in our interception thanks to your info...)

May 20 '25 14:05 mihalicyn

Hey. No problem @mihalicyn . Glad it's proved useful 👍

Please, can you try to check if it works without security.syscalls.intercept.mknod being set to true

Interestingly, that does work, which seems to backup your suggestion that the it's a bug in the interception for whiteout devices.

# zcat /var/snap/osp-orb/current/foo.tar.gz | docker load
0238a1790324: Loading layer [==================================================>]  121.3MB/121.3MB
e6e2ab10dba6: Loading layer [==================================================>]   49.6MB/49.6MB
68866beb2ed2: Loading layer [==================================================>]  181.5MB/181.5MB
21e1c4948146: Loading layer [==================================================>]    597MB/597MB
e077e19b6682: Loading layer [==================================================>]  19.25MB/19.25MB
3aff9f9c9f44: Loading layer [==================================================>]  40.25MB/40.25MB
5e7745c5bee2: Loading layer [==================================================>]   5.12kB/5.12kB
17b02461857a: Loading layer [==================================================>]   10.5MB/10.5MB
31ee71fe1ab5: Loading layer [==================================================>]  15.05MB/15.05MB
Loaded image: foo:latest

I'm sure its obvious, but just to make it clear to anyone else reading, the main issue reported here doesn't work, with or without security.syscalls.intercept.mknod being set to true

May 20 '25 16:05 jocado

Interestingly, that does work, which seems to backup your suggestion that the it's a bug in the interception for whiteout devices.

yep, fix is here https://github.com/canonical/lxd/pull/15634

I'm sure its obvious, but just to make it clear to anyone else reading, the main issue reported here doesn't work, with or without security.syscalls.intercept.mknod being set to true

I'm pretty sure that for 99% of docker workloads inside the container you don't need security.syscalls.intercept.setxattr or security.syscalls.intercept.mknod to be enabled. For a simple reasons:

modern application container runtimes detect user namespace presence and has a fallback mechanism for mknod (see for example (https://github.com/opencontainers/runc/blob/8c72dfae65c2a03b98646c55aaa49cd5fea21ab9/libcontainer/rootfs_linux.go#L900)
modern -/- runtimes and overlayfs support userxattr mount option which allows to get rid of high privilege requirements to operate xattrs in many standard cases.

Now let's get back to what you've reported initially. It was this:

# lxc config show -e test-container |grep security
  security.nesting: "true"
  security.syscalls.intercept.mknod: "true"
  security.syscalls.intercept.setxattr: "true"
#

# lxc exec test-container /bin/bash
# docker run -it --rm --privileged busybox mknod /root/test c 1 3
mknod: /root/test: Operation not permitted
#

And yes, it doesn't work. And unfortunately, we can't really fix that on overlayfs for a reasons I described in https://github.com/canonical/lxd/issues/14849#issuecomment-2755488159 I had an idea of a potential fix, but it was wrong because it doesn't allow to make this in a reliable way. It can help to create a device node once, but then after container restart it will be a problem.

Is it a real thing that you need to execute mknod from inside the container? Or initially, it was just about this images unpacking issue? (and I believe that an image unpacking issue shouldn't bother you anymore as we fixed this).

May 21 '25 11:05 mihalicyn

Hey @mihalicyn - thanks for that info, and great to here there is a fix for the issue with whiteout devices you mentioned earlier 👍

Is it a real thing that you need to execute mknod from inside the container? Or initially, it was just about this images unpacking issue?

The image unpacking failure was what made me dig into it for sure. If we're saying that actual real mknod is simply not viable inside a docker container, then and that runtimes should already handle that using a fallback method, then I think we are good.

I suppose I'm still a little unsure/confused about why the mknod test fails if the runtime should have some kind of fallback. In my test cases I'm using the nvidia runtime, but it's supposedly just a thin wrapper around runc.

Certainly, if I run mknod test outside of LXC, just on the host, it works fine:

# docker run -it --rm --privileged busybox mknod /root/test c 1 3
#

Finally, the standard instructions for running docker inside LXC instructs to setup:

lxc config set demo security.nesting=true security.syscalls.intercept.mknod=true security.syscalls.intercept.setxattr=true

Is that just to enable dockerd to function, and nothing really to do with the containers themselves ?

If we can explain and wrap up the above questions/points, then I think it should be fine to close this.

May 21 '25 12:05 jocado