runc icon indicating copy to clipboard operation
runc copied to clipboard

[Proposal] New DMZ solution to defeat CVE-2019-5736

Open lifubang opened this issue 1 year ago • 2 comments

Because of CVE-2019-5736, we have to take some actions to prevent runc binary in the host could be modified from the container, we had many solutions for these years, for example: bindfd, memfd, otmpfile, os tmpfile, Embedded dmz and overlay. Most of then can work for most of times, but in practice, there were many issues for these solutions. So we should find some new possible solutions.

For Embedded dmz solution, we embed a small binary file to start the real container process, but it may cause some issues(#4518), especially the capability behavior change issue(#4125). So we make this solution as a opt-in solution. Consider we have moved the binary clone code from runc init to runc parent process(#3987), so the memory usage of binary clone will not be included in container's memory cgroup accounting. We can embed a big binary file in runc, and copy it to memfd, then use it to start runc init process.

The whole steps should like this:

  1. Move runc init to contrib/cmd and it could be compiled as a separate binary file, for example name it as runc-dmz;
  2. Embed runc-dmz to runc binary;
  3. Remove the runc-dmz file;
  4. Change the old dmz solution to copy runc-dmz binary data to memfd;
  5. Use this memfd to start runc init, like runc-dmz init.

I think the size of runc-dmz should be smaller than runc, and we can read it to memory paralleled, so it would help to reduce the start time of the container, and will fix the issue #4449.

lifubang avatar Oct 16 '24 09:10 lifubang

I'm not sure that a separated runc init would be much smaller (most of runc's logic lives in runc init after all) and doing this split would make the whole runc binary a fair bit larger. So the overhead of copying the binary would still be fairly large (given that the current ~15MB copy adds ~60% overhead using the figures from #4448.)

The overlayfs approach (which it seems I suggested doing in the commit message when I wrote this protection in 0a8e4117e7f715d5fbeef398405813ce8e88558b :sweat_smile:) has basically no performance or memory overhead (based on my performance testing in #4448, it has <1% overhead over doing no copying at all and that might just be noise) and solves the most common case. It also has the benefit of giving us page cache sharing, which none of the copy-based solutions can.

Conceptually, having runc init be separate sounds nice but I don't think the extra binary size overhead is worth it. 15MB for a program as small as runc is already insane, let alone the ~25MB we would get if we had runc init be a separate embedded binary.

(Also O_TMPFILE and tempfiles are the same solution. It's just that O_TMPFILE is the nicer way of doing it on newer kernels.)

cyphar avatar Oct 16 '24 11:10 cyphar

I played with the separate runc init a long time ago; I don't remember any details now except it was not worth it.

Overlay approach indeed looks very promising since most of the systems where runc is used have overlayfs. The only catch is it requires a relatively new kernel v5.1, released in May 2019, meaning some very old distros might not have the syscalls needed (fsopen/fsmount); but even Ubuntu 18.04 has kernel 5.15 (in HWE), and RHEL8 kernel has needed syscalls backported.

kolyshkin avatar Oct 18 '24 01:10 kolyshkin

IMHO now that we have the overlay solution, maybe we should consider dropping runc-dmz since it has too many limitations, complicates the binary cloning logic, and complicates our build process a fair bit (we might even be able to drop cc_platform.mk). Even if we were to go for the "separate runc init binary" solution, the code we currently have would not be needed anyway.

cyphar avatar Oct 25 '24 07:10 cyphar

Agree!

lifubang avatar Oct 25 '24 08:10 lifubang

IMHO now that we have the overlay solution, maybe we should consider dropping runc-dmz since it has too many limitations, complicates the binary cloning logic, and complicates our build process a fair bit (we might even be able to drop cc_platform.mk). Even if we were to go for the "separate runc init binary" solution, the code we currently have would not be needed anyway.

@cyphar not really related, but I'm also thinking about dropping cmd/contrib/memfd-bind. While not as complicated and limited as dmz, it is still complicated (to setup and maintain wrt upgrades) and limited (non-root users are out). OTOH it's harmless (but kind of useless now).

kolyshkin avatar Oct 29 '24 23:10 kolyshkin

I'm of two minds about that. Using it completely removes any potential overhead from /proc/self/exe cloning, but on the other hand the overlayfs stuff is so low overhead that it seems unlikely someone is going to care enough to use it. The fact that it also makes the binary accessible only by root and requires a daemon really make it unlikely that someone is going to use it in anger...

It really sucks that we can't bind-mount the memfd itself (even open_tree(OPEN_TREE_CLONE) doesn't work). If we could then it would make more sense to keep it. AFAIK there also is no (trivial and non-destructive) way of knowing if a file is in the lowerdir of an overlayfs -- if we could figure that out then we could adjust memfd-bind to create an overlayfs file instead, which would solve the problem for all users (even rootless containers would be able to run without overhead).

cyphar avatar Oct 31 '24 03:10 cyphar

AFAIK there also is no (trivial and non-destructive) way of knowing if a file is in the lowerdir of an overlayfs -- if we could figure that out then we could adjust memfd-bind to create an overlayfs file instead, which would solve the problem for all users (even rootless containers would be able to run without overhead).

OK, this is a valid argument for keeping memfd-bind -- while not very useful now, it could be improved to become more useful. I guess we can keep it for now and either improve or remove in time for v1.3 release.

kolyshkin avatar Dec 05 '24 01:12 kolyshkin