Drop in-container /proc requirement?
Spun off from #1726, it would be nice to not require /proc inside containers. Current blockers:
- [ ] capabilities loading, but with syndtr/gocapability#14 landed I'll be able to fix that (filed as #1735).
- [ ] closing extra file descriptors, but we can handle that before leaving the host mount namespace.
- [ ] opening the
start-signal FIFO, but we can inherit a socket through from outside withough reopening.
Do we need /proc for anything else? @cyphar seems to imply we do (or I'm reading him wrong). This is definitely an edge case, but I'd like to get it working.
So, the short version of my opinion on this is that removing the /proc requirement from inside the container would be quite cool (and it also would allow us to beef up security around malicious pid1 containers -- something that all container runtimes are lacking these days).
My comment in #1726 was a bit of a jumbled mess. The thrust of my point is that there are several theoretical security issues in runc at the moment related to /proc -- and while we can fix some of these issues, in order to fix all of them we need some extra kernel primitives we don't have at the moment.
opening the start-signal FIFO, but we can inherit a socket through from outside withough reopening.
Can you clarify what you mean here? The O_PATH and re-opening trick is necessary to protect against container escapes -- we could send it via a unix socket, but that feels more complicated. My goal would be to get openat to accept AT_EMPTY_PATH -- but this is something that I'd need to get Al Viro and Eric Biederman to accept.
closing extra file descriptors, but we can handle that before leaving the host mount namespace.
This can be done without needing /proc at all -- you can do what rpm does, where you do a fnctl(fd, F_GETFD). The downside though is that you don't know which FDs are actually mapped, so it's quite expensive to check all of them (it would be nice if there was a syscall API for this -- but I wouldn't bet on this being a popular request).
@cyphar seems to imply we do (or I'm reading him wrong).
At the moment I'd rather not detail the exact security risks we have with /proc in public -- because fixing them requires some upstream kernel work that has been quite slow-moving. However, the point of my comment was not that "we need /proc inside the container" -- its that there are several security problems that we need to resolve with /proc in the future.
This is definitely an edge case, but I'd like to get it working.
I think that the key benefit of removing the /proc requirement is not so that people can run containers without /proc mounted (such containers are pretty much useless to be honest). The benefit would be the added security you get of the runtime not having to trust the filesystem in the container (this is the one of the reasons I added TIOCGPTPEER in Linux 4.13).
On Thu, Feb 22, 2018 at 11:47:20PM -0800, Aleksa Sarai wrote:
opening the start-signal FIFO, but we can inherit a socket through from outside withough reopening.
Can you clarify what you mean here? The O_PATH and re-opening trick is necessary to protect against container escapes -- we could send it via a unix socket, but that feels more complicated.
I'll work up a PR and we'll see how complicated it ends up being.
closing extra file descriptors, but we can handle that before leaving the host mount namespace.
This can be done without needing
/procat all -- you can do whatrpmdoes, where you do afnctl(fd, F_GETFD). The downside though is that you don't know which FDs are actually mapped, so it's quite expensive to check all of them (it would be nice if there was a syscall API for this -- but I wouldn't bet on this being a popular request).
But where are these extra file descriptors coming from? If we use the host /proc to close them off before leaving the host mount namespace, and we CLOEXEC everything we open internally and do not intend to pass on, there should be no need to check again inside the container mount namespace. Again, I can work up a PR and we'll see how this works out.
At the moment I'd rather not detail the exact security risks we have with
/procin public…
That's fine; for the purpose of this issue it's enough to know that your insurmountable issues were with protecting a container-side /proc, not about removing our need for one.
Looking at this today, this seems like something we could not (and probably should not) try to achieve, as /proc is an integral part of modern Linux and many tools would not work without it. Therefore I think we should close this one, wdyt @cyphar?
On a different note, a better namespaced and more secured /proc for containers, akin to what we used to have in OpenVZ kernel, would be nice to have from the Linux kernel.
@kolyshkin I always interpreted this issue as being more about the pathrs-lite and libpathrs procfs wrappers which let us not have to trust the /proc in the container.
I would also like a more namespaced /proc. We do finally have subset=pid but that's fairly minor compared to the OpenVZ stuff. Unfortunately, trying to containerise /proc has historically resulted in lots of pushback from upstream.