Sandbox all document processing in gVisor
When running on Linux, Dangerzone currently uses Podman with its default crun/runc runtime. These runtimes rely on Linux's built-in containerization primitives (namespaces). These parts of the Linux kernel have historically been the target of many container escape vulnerabilities. This is due to the fact that the Linux host kernel is fully exposed to the application running inside the container.
In particular, PDF and PostScript libraries such as Ghostscript have been notorious for having been targeted to run precisely this type of exploit. For this reason, while running these tools within Linux containers is better than running them directly on the host, it does not fully shield the host kernel from the malicious code that may be running in the container.
This pull request implements optional support for the gVisor container runtime, called runsc ("run sandboxed container"). gVisor is an open-source OCI-compliant container runtime. It is a userspace reimplementation of the Linux kernel in a memory-safe language.
It works by creating a sandboxed environment in which regular Linux applications run, but their system calls are intercepted by gVisor. gVisor reinterprets these system calls using the logic in its own kernel written in Go, and responds to the system call by itself, rather than passing it onto the host Linux kernel. This means the host Linux kernel is isolated from the sandboxed application, thereby providing a significant level of protection against Linux container escape attacks.
gVisor further hardens itself by using the typical container primitives (isolating its own view of the host filesystem, running in the various types of namespaces that Linux support), and also sets a restrictive seccomp-bpf policy that only allows basic system calls through. This way, even if its userspace kernel were to get compromised, attackers would have to additionally have a "typical" Linux container escape vector, and that exploit would have to fit within the restricted seccomp-bpf rules that gVisor adds on itself.
This provides a level of protection comparable to a hardened hypervisor running workloads in a VM. However, gVisor doesn't actually use virtualization, so it is portable to all Linux environments and doesn't require virtualization support. It runs on x86 and ARM.
The initial commit of this pull request only adds support for using it inside isolation_provider/container.py. If there is appetite for this runtime, I'm happy to add CircleCI tests to integrate it better. Let me know what you think!
Thanks for the contribution! We are just wrapping up a release and I'll come back to this after we're done with that.
As noted in https://github.com/freedomofpress/dangerzone/pull/589#issuecomment-1759343903, the usage of Docker is problematic. So I have updated this PR to use gVisor with Podman rather than Docker. Let me know if this is more interesting, and I can add tests, QA instructions, and better documentation.
Thanks for pivoting from Docker to Podman. I think we now have a clearer path for adding gVisor support in Dangerzone. So, to answer your question, yes, feel free to polish this PR with tests and instructions.
Personally, gVisor support is something that I really wanted us to tackle (https://github.com/freedomofpress/dangerzone/issues/126) for a long time. The fact that we can now have this support for Linux, be it optional, is great. But my aspiration, and it shouldn't affect this PR btw, is to include gVisor support across all platforms. That is, running runsc within the container itself, so that users in MacOS and Windows (which are the main platforms that journalists use) can also be protected.
This effort is currently blocked until we have more input in https://github.com/google/gvisor/issues/8205. But once we tackle #443 and we no longer need mounted directories, we can make progress in this front. Until then, this PR is super welcome, and thanks for spearheading this effort :slightly_smiling_face:
Thanks for the details; great to see there's already been lots of investigatory work on this.
I actually had independently tried to implement it via the "in-container runsc do" mechanism you mention; it would indeed mean better security on Windows/Mac, because it would protect the two containers from each other (otherwise, AIUI they run in the same VM that Docker desktop manages). I actually got it to mostly work with runsc do; the caveats was that instead of using rootless runsc in an unprivileged container, it runs it as root in a privileged container, and wraps the command that runs inside the gVisor sandbox by prefixing it with sudo -u dangerzone --. This sounds scary, but I actually don't think that's such a problem, given that the security boundary would move to gVisor rather than to the container boundary: with such a setup, the role of Docker/Podman would no longer be that of a security boundary, and instead just fulfill the role of a cross-platform software portability solution of sorts.
The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur), but also how runsc do doesn't have good control over which directories in its sandbox are writable back to the host and which aren't. Relying on gVisor's enforcement of file permissions would work but seems like it's sub-optimal. #443 sounds like it partially solves this, but also it'd be nice to just support this better in runsc itself.
Alternatively, manually creating an OCI container spec and starting it within runsc-in-Podman by using manual invocations of runsc create + runsc start would allow fine-grained control over mounted directories without needing runsc modifications. runsc do is basically just a helper command to do just that, it just doesn't have command-line flags to control the "volumes" part of that container spec.
If that sounds like an interesting direction, then let me know and I can go that way rather than the "just make it work in Linux with Podman" approach.
Nice to see that your exploration has crossed the path with @apyrgio's.
The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur)
This won't be a problem one we fix #443 (which we probably need to do for the next release anyways since we want to have the Qubes code shared as much as possible with the containers one).
Hey, sorry for the delay @EtiennePerot, life got a bit in the way :wink:
I actually had independently tried to implement it via the "in-container runsc do" mechanism you mention
Hey, that's great!
with such a setup, the role of Docker/Podman would no longer be that of a security boundary, and instead just fulfill the role of a cross-platform software portability solution of sorts.
I was thinking about this backwards, that the main sandbox would be the Docker/Podman runtime, and gVisor would supplement it with it's kernel protection capabilities. The way you think about this makes sense though, having a very strong inner sandbox, and using the outer sandbox for cross platform support.
Before going down the full privileged container route, I'd try to figure out which capabilities are necessary to make this work, so that we minimize the blast radius in case of a runsc escape. One would argue that the blast radius would be the same as in the case of a runc escape of course, but a little defense in depth here wouldn't hurt.
The only remaining problems with that approach I had were around file permissions and ownership of the files written back to the out-of-sandbox filesystem (fixable but probably requires me to dig more into where such fix-ups would need to occur),
Quick note regarding this: Since #443 is a blocker for oh-so-many-things, we've already started working on this and will soon have a prototype. I'd suggest you consider this as a non-issue, i.e., that we won't mount any directories into the container.
If that sounds like an interesting direction, then let me know and I can go that way rather than the "just make it work in Linux with Podman" approach.
It very much is :slightly_smiling_face: . If you're up for it, feel free to let us know. Else, we can try to tackle it once we're done with #443, and some Qubes fixes. In any case, thanks a lot for your persistence on this issue and sharing your gVisor knowledge.
Quick update here: we have some PRs under way (see #622, #627) that fix #443, at least for the first stage of the conversion. We can start experimenting with gVisor on top of them, but this task will be much clearer once the PRs are merged.
Another update: it seems that gVisor will soon have the ability to run within rootless Podman, which will simplify things a lot for Dangerzone. @EtiennePerot, sharing this in case you're still interested.
Yes, I've been following that :)
Still planning on getting around to this PR.
Uploaded a new commit that uses the approach of putting gVisor inside the Dangerzone container image, and use it as a wrapper regardless of the container runtime. It works with both Podman and Docker, although I have only tested on Linux and I am not sure how file permissions with Docker volumes work on non-Linux platforms.
Also, the tests will fail because the latest gVisor release does not yet include https://github.com/google/gvisor/commit/88f7bb66f0dc4607d67281d6d61040f98375166b. I have tested it against a freshly-compiled gVisor binary and all tests pass on my end. I expect https://github.com/google/gvisor/commit/88f7bb66f0dc4607d67281d6d61040f98375166b will be part of next week's gVisor release.
Please read the long comment in dangerzone/gvisor_wrapper/entrypoint.py which explains the approach and why it works the way it does... It is difficult to get all the permissions lined up properly.
Let me know what you think!
Woah, that's exciting! We're currently in the midst of releasing Dangezone 0.6.0 so I can't take a proper look right now, but I promise to do so as soon as possible.
One quick comment I have while skimming through the comments for the entrypoint is this: I understand that the /safezone mount is a factor that makes things more complicated. In practice though, it shouldn't affect the gVisor work. There are two reasons for that:
- In practice, we want to use gVisor for the sanitization that takes place on the first container. This container is the most sensitive one, since it's the only one [1] which is affected by RCEs. The good thing about this container is that we don't mount anything to it. It just reads the document from stdin and writes the pixels to stdout.
- The
/safezonemount point will soon be removed, along with the second container. We are now very close to running the conversion of pixels to PDF on the user's host, which has several important benefits (smaller container image size, faster OCR, ability to update the container image from an image repository etc.)
So, in an nutshell, I believe that we can simplify things more, by targeting just the first container, and not worry about mounts. But I'll have more concrete things to comment on once we're done with 0.6.0.
[1] Well, there's always the scenario where the code that converts pixels to PDFs is subject to an RCE, but that's a very narrower attack surface, and we can't effectively defend against it. In Qubes, for instance, this part runs on the user's host.
Thanks for the details. I believe removing the need to have the /safezone mount would indeed simplify things a lot; specifically, it removes the need to have precise UID:GID alignment in the unsandboxed entrypoint, and it totally removes the need for the sandboxed entrypoint, as well as the three capabilities granted inside the gVisor sandbox.
Part of my thought process while building this was actually "can we just not have this mount at all and instead pass the files back and forth via something like a tarball over stdin/stdout?", but I guess simply not running the pixels-to-PDF part in a container solves that problem too.
Alright, I looked more carefully into the PR. I have several questions, some of those are just basic gVisor questions, and some apply to Dangerzone specifically. Here goes:
-
Assuming that we use
runsconly on the first container, what is the practical difference between usingrunsc doandrunsc run?Context: The reason I'm asking is because the following invocation :
runsc --rootless --network=none do env PYTHONPATH=/opt/dangerzone \ python3 -m dangerzone.conversion.doc_to_pixelsworks when I test things locally, and in the spirit of keeping things as simple as possible, it's very appealing. However, I do read that this option is for testing only, so I'm skeptical if there's a footgun that I'm missing.
(I've seen your OCI config btw, and I already spot some differences, but I'm mainly asking if there's anything foundational that I'm missing)
-
Before including gVisor support, we should clarify what are the security guarantees that the outer container provides (Podman/Docker), and what are the security guarantees that the inner container providers (gVisor). From the code and the documentation on the entrypoint script, I understand that the outer layer:
- Does no longer drop capabilities. This is the responsibility of the inner layer.
- Does no longer set a seccomp filter. This is done by the inner layer, and in a more strict fashion as well.
Anything else that's changed?
-
We have a pending issue for supporting user namespaces (#228). This is something that Docker does not support, but gVisor does, so that's great! My main issue with user namespaces is that UID 0 typically maps to the owner of the namespace in the host. In Linux (Podman) that's the user who starts the Dangerzone application. In Windows / macOS, it's the
rootuser that runs in the WSL/HyperKit VM. What's the case in gVisor?(My plan was to use something like
userns=nomapfor supported Podman versions. If gVisor can support something similar across OSes, that would be amazing)
@EtiennePerot kind ping on the above questions, so that we don't lose context.
Hey, sorry I missed the set of questions.
Assuming that we use
runsconly on the first container, what is the practical difference between usingrunsc doandrunsc run?
runsc do is just a convenience helper. It is true that under the hood it's basically just doing runsc create + runsc run, but there are many knobs that runsc do doesn't expose. For example, runsc do exposes the whole host filesystem by default, which is unnecessarily wide for Dangerzone. There are also some subtleties around how runsc do operates to act as a convenience wrapper; for example, in root-ful mode with networking enabled, it'll do a whole dance with network namespace setup. It's meant to be for convenience, not so much for the tightest possible security.
But beyond the OCI differences (which I think are worth digging into), the main reason is simply supportability. runsc create and runsc run as specified as part of the OCI runtime interface spec, whereas runsc do is just a gVisor-specific helper. It's therefore less stable of an API and it makes no guarantee to keep working the way it does into the future (e.g. it makes no difference that its spec will stay the same and not make further security compromises). I think it's worth avoiding runsc do for that reason alone.
(Well, technically the OCI spec only specifies runsc start rather than runsc run, but the interface runsc run takes is meant to be the same as runsc start so I think the argument still applies. Though I could edit the script to use runsc start if you think it's better.)
In terms of practical OCI spec differences:
-
runsc doenables TTY emulation by default. This is a notoriously complicated part of the POSIX API, and Dangerzone doesn't need it, so better to have it disabled. -
runsc dowould mount the whole filesystem as/by default (as per above), and does so non-readonly. (In practice, the host filesystem is still effectively read-only, because gVisor implements an overlay on top of the root filesystem. But specifying the OCI spec also allows explicitly mounting the root as read-only inside the sandbox.). This also means that the in-sandbox workload can see files such as the out-of-sandbox entrypoint (entrypoint.py) and therunscbinary itself, i.e. it can trivially learn more information about its own environment than it strictly needs to know. The manual spec only mounts the specific parts of the filesystem that Dangerzone actually needs. -
runsc dodoesn't allow setting mount options, whereas the manual spec can specify mount options likenosuid,noexec,nodevto further restrict what the workload can do with files exposed from/safezone. -
runsc dodoesn't make guarantees about which host namespaces it isolates itself in, whereas the manual OCI spec causesrunscto isolate itself in a separate host PID+Network+IPC+UTS+mount namespace. -
runsc dodoesn't allow specifying an in-sandbox seccomp filter. We could add one for more defense-in-depth; see below.
Before including gVisor support, we should clarify what are the security guarantees that the outer container provides (Podman/Docker), and what are the security guarantees that the inner container providers (gVisor). [...] Anything else that's changed?
I think the main thing is, well, that the inner layer now has gVisor in the middle. I don't want to sound too salesman-y, but gVisor emulates a full-fledged POSIX kernel implementation. It's not just about one security measure being moved to one or the other layer; for some of them, it's effectively doubling them up; there are now two distinct implementations of the same security measure that need to simultaneously be exploited in order to breach through.
For example, if the sandboxed workload is running as an non-root user inside the gVisor sandbox, and the gVisor sandbox is itself running as an unprivileged user on the host, then for the sandboxed workload to escalate to root on the host, they would need to have a user escalation exploit for both kernels (and, since those two kernels don't share code, the same exploit generally won't simultaneously work against both). Same thing for filesystem-level isolation: that security measure exists at both layers (technically there's actually three levels of filesystem-level isolation: gVisor's implementation, then the fact that the gVisor process places itself into a chroot + Linux mount namespace that only has the minimum possible from the host filesystem exposed, and then there's Docker/Podman's own filesystem isolation). Similar situation for PID namespaces: there's gVisor's own in-sandbox process tree, there's the fact that the gVisor process isolates itself in its own Linux PID namespace (as specified in the OCI spec), and there's the fact that the Docker/Podman container itself is running in a dedicated Linux PID namespace. Same kind of thing for the other types of namespaces.
More generally speaking, it takes at least two unique kernel vulnerabilities (one specifically against the gVisor kernel, plus one specifically against the Linux kernel) in order to fully break out onto the host system. On Windows/OSX, it takes one gVisor kernel exploit plus a VM escape exploit to fully break out.
There are some layers that aren't "doubled up" in this manner, like the seccomp one you point out, although actually it would be possible to add a seccomp filter enforced within the gVisor sandbox (using gVisor's seccomp-bpf implementation) on top of what's already here. I can add that to this PR if you wish. But if we're going to do that, I actually think there's potential to go beyond applying something like Docker's default seccomp filter, and come up with a fine-tuned seccomp filter that specifically allows strictly what the PDF-to-pixel program needs.
In terms of thinking of security responsibilities (i.e. what each layer "ought to" provide), at a high-level, I think the framing in an earlier comment is pretty well formulated: the outer container's main responsibility is to act as a "platform compatibility" solution, whereas the inner container's responsibility is solely security. The outer layer's own security measures (e.g. non-privileged user, filesystem isolation, PID namespace, etc.) can be seen as just an added security bonus, so long as they don't interfere with the inner layer's ability to work properly (hence, as an example, the need to remove the Docker-level seccomp filter).
We have a pending issue for supporting user namespaces [...] In Linux (Podman) that's the user who starts the Dangerzone application. In Windows / macOS, it's the root user that runs in the WSL/HyperKit VM. What's the case in gVisor?
The user namespace handling is probably the most complicated part of the current implementation, because of the need to preserve UIDs on Linux so that files in the /safezone volumes are mapped to the user's UID on the host. As long as that remains necessary, then running as a user that ultimately maps to that UID on the host is unavoidable.
However, if, as per the above discussion, we can get rid of this, then all of the current user namespace stuff can be simplified and further locked down to have no relationship with any existing user on the host system. This means we could create a user that only exists inside the outer container, and run the gVisor process as that user in a user namespace that exposes no other user. (On the initial host user namespace, I believe it would appear as an unnamed user with a UID that isn't in /etc/passwd.)
Then, on top of that, since gVisor is its own kernel and thus implements its own notion of users and user namespaces, the workload within the sandbox can itself run in an in-sandbox user namespace that doesn't have a mapping to the host user (i.e. to the user that the gVisor process runs as). (The current implementation of this PR kind of does this already. It has two in-sandbox users: the python3 -m dangerzone.conversion.doc_to_pixels command runs as the "UID 1000" user which has no mapping to any user outside the sandbox, and a "UID 0" that maps to the host UID that owns /safezone so that file permissions can still line up properly.)
Thanks Etienne for answering all the my questions in great detail. Not only I'm covered, but I think we have enough material to update the parent issue, and write down a design document. I plan to follow up on the above on Monday, and maybe offer some next steps. My guess is that our lives will be much easier once we've tackled #625, so I'll make sure to prioritize it next week.
Sounds good. One small question: which issue do you mean by "updating the parent issue"?
I agree that addressing #625 first makes sense, otherwise this PR would add temporary complexity that doesn't need to ever exist if #625 is addressed first. If you wish, I can already start simplifying this PR to what it would look like if it only needs to support PDF-to-pixels conversion.
Sounds good. One small question: which issue do you mean by "updating the parent issue"?
I was referring to this issue: https://github.com/freedomofpress/dangerzone/issues/126. It doesn't have the context that this discussion has, so I'd like to move some there, for future reference.
If you wish, I can already start simplifying this PR to what it would look like if it only needs to support PDF-to-pixels conversion.
Sure, if it's not too much of a hassle for you. I don't expect we'll have many more architectural changes in the near future, so you should be good to go. The only relevant thing I can think of is that we'll experiment with switching to a Debian image soon, but I think this should not affect this discussion.
Quick update here. I actually prioritized implementing the on-host pixels to PDF conversion PR (https://github.com/freedomofpress/dangerzone/pull/748), which is a prerequisite for vastly simplifying this one. Now that it's out, I'll follow up here soon.
I've updated this PR to remove the support for the /safezone volume. As promised, it simplifies entrypoint.py a whole lot.
Also, thanks for pointing out Podman's --userns=nomap option. Turns out that already does what I was otherwise planning to do manually from within entrypoint.py to run as a user that's as unprivileged as possible, so that simplifies things further. I've added this flag to container.py.
The latest branch looks almost ready for inclusion. I want to do a last pass, document our architectural choices, and run tests on every platform. The latter are currently failing, so I'll try to debug a bit there.
While experimenting with this PR, I realized that we can re-introduce --cap-drop all, if we add the following capabilities: SETFCAP and SYS_CHROOT. It's not much for the security of the outer layer (Podman/Docker), but it's something.
Also, my understanding is that the sandboxed UID/GID that we search for with chroot /dangerzone-image/rootfs id -u dangerzone will always be 1000, so we can remove these commands. Note that the SYS_CHROOT capability will still be required by runsc, so we can't remove it from the aforementioned list.
Heads up, I have a design document ready, that should explain how the gVisor integration works in Dangerzone, to people who have not seen the code: https://github.com/freedomofpress/dangerzone/pull/815/commits/8641b66b0db634d1b6b849f9047a93671d7c5a13
@EtiennePerot if you feel I have missed something important, or that I've explained something incorrectly, feel free to chime in. In the meantime, I'll take a look at the CI build failures.
While experimenting with this PR, I realized that we can re-introduce
--cap-drop all, if we add the following capabilities:SETFCAPandSYS_CHROOT. It's not much for the security of the outer layer (Podman/Docker), but it's something.
I'm not at my development machine now but will do when I get the chance.
Also, my understanding is that the sandboxed UID/GID that we search for with
chroot /dangerzone-image/rootfs id -u dangerzonewill always be 1000, so we can remove these commands. Note that theSYS_CHROOTcapability will still be required byrunsc, so we can't remove it from the aforementioned list.
I'm not sure that's actually always the case; it depends on which UID/GID the adduser/addgroup commands use during the image building process. Now that they no longer have an explicit UID/GID passed to them, I'm not sure we can assume that they are stable across build environments.
It's true that they are stable per image build, so it would be possible to write the dangerzone UID/GID to a file in the outer container filesystem during the image build process, and read that file from the Python code. But I figure that the "spawn a chrooted process to find out" solution is essentially the same thing: it's reading the UID/GID values from a file in the container filesystem just as well. :)
Another solution might be to have some placeholder lines in entrypoint.py like
DANGERZONE_UID = %DANGERZONE_UID%
DANGERZONE_GID = %DANGERZONE_GID%
... and then to call sed s/%DANGERZONE_UID%/${DANGERZONE_UID}/g (and same for GID) on entrypoint.py during the image build process. Then it would appear "hardcoded" inside the Python script, but still not depend on how Podman/Docker/adduser/addgroup decide to allocate user/group ID ranges.
Heads up, I have a design document ready, that should explain how the gVisor integration works in Dangerzone, to people who have not seen the code: 8641b66
@EtiennePerot if you feel I have missed something important, or that I've explained something incorrectly, feel free to chime in. In the meantime, I'll take a look at the CI build failures.
Thanks, this is great! Left some comments on the commit.
I'm not sure that's actually always the case; it depends on which UID/GID the adduser/addgroup commands use during the image building process. Now that they no longer have an explicit UID/GID passed to them, I'm not sure we can assume that they are stable across build environments.
Hm, I'm ultimately ok with that. Those commands are stable enough, and I expect them to assign UID 1000 to the created user consistently. It may be the case though that Alpine Linux decides to add a built-in user in the container image (looking at you, Ubuntu :roll_eyes: ), and the UID will then become 1001. Even then, our nightly checks will fail loudly and we can fix it.
On the upside, we'll be closer to removing one more capability from the Docker/Podman invocation, and slightly improve readability. So, I'm more inclined to go that way, if you don't have a strong objection.
I've made some fixes to another branch of mine (see wip-gvisor-2), to make the code work for the second phase as well, and make it suitable for our various Linux platforms. I think the Debian Bullseye tests are still failing though....
In any case, tell me what you think. I'm especially curious about the handling of the second phase, since I'm doing some radical stuff there :grimacing:
I think I've made the PR work in every platform (~haven't tested on Windows yet~ edit: just did and it works :partying_face: ). I've made some extra fixes in my wip-gvisor-2 branch:
- Make PyMuPDF within the container get the correct tesseract data path. It's hacky and I don't like it at all, but thankfully all these will soon be removed.
- Debian Focal/Bullseye were failing, because the Podman version that they ship does not contain the
ptrace()syscall in the default seccomp policy. I've added aseccomp.jsonpolicy file (taken from the official repo) that includes this syscall. This policy will be used just for these distros. - I tested this PR on macOS (edit: and Windows), which uses Docker, and to my surprise, I saw that using an unconfined seccomp policy is not necessary.
Hi @EtiennePerot :wave:. From my side, I think we're in a position where we can finally merge the gVisor PR (see wip-gvisor-2 branch) and the corresponding design document, with some polishes here and there.
Before doing so, I would greatly appreciate your opinion on the changes I've mentioned above. We've already discussed about the change regarding UID 1000, but I've made some extra changes since then, while testing Dangerzone across all of our supported platforms. We plan to merge these PRs starting next week, but if you want more time to look into them, let me know.
Hi, thanks for the review and all the work on the follow-up changes. Will review this weekend
Oh, one more thing for when you have time. I was under the impression that gVisor was using a more modern approach to ptrace() called systrap. However, I'm not sure if it's used within the container, or if it silently fallbacks to ptrace(). I'm questioning this because some of the issues were resolved once I allowed the ptrace() syscall in selected platforms.