runsc inside of default docker seccomp policy
@scanlime on Twitter is trying to run runsc inside a Docker container with the standard seccomp policy enabled. This is similar to rootless mode (#311), but a little bit more strict.
The immediate issue is that we exec into empty namespaces, which the profile does not allow. It is not clear if there would be more issues if that were resolved, though I didn't see any glaring issues comparing our seccomp filters to Docker's.
It's also not clear if the defense-in-depth features we'd have to disable to make this work would make it a bad idea. But in general, it is very reasonable to want to run a sandbox as a subprocess in an existing container.
cc @fvoznika @nlacasse
Hi, prattmic invited me here to explain a bit further. Thanks for opening the issue. It's still in the experimental stages but I'm working on a media server that will be doing transcodes in a locked down environment, and I've been investigating which approaches are possible to adopt without adding a portability burden. For wide use i would really want the resulting project to run in a docker container, and adding extra privileges to that container for the promise of better sandboxing within the container is really a non-starter. So that has me looking at either forking gvisor and collapsing it to the bare minimum really, or writing a tiny seccomp and ldpreload thing just for this purpose.
That's an interesting use case. gVisor setups up many layers of defense around the sandbox to reduce access to host in case the sandbox is compromised. You can find more details in the Containing a Real Vulnerability blog post. Configuring some of these layers require CAP_SYS_ADMIN that are not available to Docker containers unless docker run --privileged flag is used. This is unintuitive, but in the same way that Docker requires to be running as root to create containers, gVisor needs root to protect the sandbox as well.
So you can remove some of the layers that require high privilege (e.g. namespaces, pivot_root), similar to the way --rootless does. You just need to be aware that in the end the sandbox will not be as protected. You need to decide whether this protection is good enough for your use case. On the positive side, the most important security layer is seccomp, which you can still use.
Another consideration is that the gofer requires these capabilities to function correctly. Otherwise, some operations will not be allowed, like creating files with different owners.
A few other options to consider are:
- Run everything inside gVisor: this works if you don't need to isolate media server from the code that would be running inside gVisor
- Use different containers: run the media server in one container and untrusted workload in runsc. You can connect then using container network. This would require giving the media server access to the Docker unix socket to run containers.
https://github.com/avagin/gvisor/commit/1924e9e258e0ee687db120ab3d70f36d93209f12
This is the POC patch which adds the --unprivileged flag. With this flag, the gVisor can be executed in a default docker container with limitations that Fabricio described in the previous comment.
$ docker run -it --rm -v /tmp/runsc:/mnt alpine /bin/sh
/ #
/ # /mnt/runsc --rootless --unprivileged --network none do echo 'Hello World!'
Hello World!
If the unprivileged flag is specified, Sentry and Gofer processes are running in the current set of namespace. By other words, we remove one level of isolation. But in your case, a docker container provides you this extra level, so I think your use-case can be still valid.
This is the POC patch which adds the --unprivileged flag. With this flag, the gVisor can be executed in a default docker container with limitations that Fabricio described in the previous comment.
This is interesting! How do filesystem and PID isolation work in this case? It looks from my untrained reading that this would allow access to send signals to other processes owned by the same user, or open files accessible to the current user. Does the emulated kernel provide that level of isolation?
Yes, the userspace kernel still provides isolation, because it is emulating the OS based on the provided configuration. These capabilities are providing defense-in-depth in the event that the kernel is compromised.
In the case of signals, all thread IDs within the sandbox are entirely internal to the userspace kernel, with no relation to the host. Signal syscalls sent by the sandboxed application can only target other threads in the sandbox (implementing a signal syscall in the userspace kernel may not even send a host signal at all). In fact, using a PID namespace is really a third layer of defense, since the userspace kernel can't send signals to other processes anyways.
Similarly, the userspace kernel can't directly open host files. That is mediated by another process called the gofer. The gofer won't grant access to files not allowed by the configuration, but were it to be compromised, the mount namespace containing only the configured files provides an additional layer of protection.
This is interesting, slightly different use-case I was also trying to use runsc inside a container in an attempt to get a dockerised (proprietary) application to work on a hardened K8S platform, the application requires some privileged capabilities to work however the container platform drops all privileged capabilities for tenants for security reasons.
The idea was to use gvisor in a pod to (in a crude sense) pick these capabilities back up as a compatibility layer for the app. Would such a thing be feasible?
It's not a generic solution that will work with all containers, but it may work in your case depending on what the container does at runtime. In gVisor, all file system operations are handled by an external file proxy, called Gofer, that is isolated from the sandbox for security purposes. The gofer requires capabilities to function correctly. For example, when the container creates a file running as an user that exists inside the sandbox, the gofer requires CAP_CHOWN to change the file owner after creation.
Is there any possibility the --unprivileged flag from @avagin's POC could be added to mainstream gvisor? It comes in handy from time to time, for the kind of use-case outlined at start of thread. I could imagine perhaps fearing that people would turn it on without understanding the trade-offs.
For my project I must also run gVisor inside a docker container for integration-testing purposes. It is not possible to do this outside of a docker container, as there are other requirements the environment has like specific file system mounting and Linux-specific tools, while the test environment must be triggered locally from systems that don't have gVisor installed, file systems mounted, or are even Linux.
Is there any way the unprivileged flag could somehow be merged or updated?
Chiming in with another request to merge in this flag. It's needed for some use cases! Any reason not to merge it in?
We've been maintaining a fork with this flag, and it's a pain to maintain a fork... If the concern is that people may use --unprivileged without understanding tradeoffs, a scarier name would be fine. Perhaps --omit-privileged-isolation?
Could a maintainer please respond if this flag can be merged in (either as --unprivileged or as --omit-privileged-isolation)? @avagin @fvoznika?