gvisor OCI: Support `overlay` filesystem type

OCI: Support `overlay` filesystem type

Open fvoznika opened this issue 3 years ago • 9 comments

Reported by Ron Braunstein:

I have an overlay mount in config.json that works for "runc", but not for "runsc"

$ grep -B2 -A7 overlay config.json
	{
		"destination": "/ov",
		"type": "overlay",
		"options": [
			"lowerdir=/tmp/lower",
			"upperdir=/tmp/upper",
			"workdir=/tmp/work"
		]
	}

I'm expecting to have an "/ov" mount in the container that uses the host /tmp/lower and /tmp/upper directories.

=== Run the container with runsc ===

(base) ron@gamer:~/runsc/bundle$ sudo runsc run demo
bash: /root/.bashrc: Permission denied
root@runsc:/# mount
none on / type overlayfs (rw)
none on /dev/pts type devpts (rw)
none on /proc type proc (rw)
none on /dev type overlayfs (rw)
none on /sys type sysfs (ro,noexec)
none on /tmp type tmpfs (rw)

root@runsc:/# df -H /ov
Filesystem      Size  Used Avail Use% Mounted on
-               206G  198G  7.5G  97% /ov

root@runsc:/# touch /ov/hi
touch: cannot touch '/ov/hi': Permission denied

root@runsc:/# exit
exit

==== on the host ===

(base) ron@gamer:~/runsc/bundle$ ls -ltr /tmp/lower /tmp/upper /tmp/work
/tmp/upper:
total 0

/tmp/lower:
total 4
-rw-rw-r-- 1 ron ron 3 Nov  5 20:46 file2

/tmp/work:
total 4
d--------- 2 root root 4096 Nov  5 21:14 work

Nov 09 '20 20:11 fvoznika

Overlay mount type is not yet supported by runsc and gets ignored. I don't see any reason why it couldn't be supported. As a workaround, runs has an option to add an overlay on top of all mounts in the container with the --overlay flag (see here how to set flags).

Nov 09 '20 20:11 fvoznika

Thanks, I had trouble getting --overlay to do what I wanted as well. i.e. I'm not sure where the files get written to. The docs talk about using it with docker, but if we run runsc directly

I created a new container:

   % cd repro/
   % mkdir rootfs
   % docker export $(docker create alpine) | tar -xf - -C rootfs
   % runsc spec

update the spec to add:

      {
      "destination": "/mnt",
      "source": "/tmp/onhost",
      "type": "bind",
      "options": ["rw"]
    }

and create /tmp/onhost/file1 on the host and make it writeable.

Then I tried writing to a file in the overlay:

  % sudo runsc --overlay run alpinec1
       ls -l /mnt/file1
       -rw-rw-rw-    1 18612    9000             0 Nov  9 22:18 /mnt/file1
       echo "testing" >> /mnt/file1
       ls -l /mnt/file1
       -rw-rw-rw-    1 18612    9000             8 Nov  9 22:23 /mnt/file1
  <exit container>

The mounted file isn't written to, so I'm not sure where the overlay writes.

-rw-rw-rw- 1 ron stripe 0 Nov  9 22:18 /tmp/onhost/file1

If I rerun without -overlay, then /tmp/onhost/file1 gets updated as expected

My real goal is to use the overlay feature to get a union of two mounted filesystems along with persisting the writes to the upper one.
I'm trying to test how -overlay will help with that. Thanks for any suggestions.

Nov 09 '20 22:11 rbraunstein

The flag -overlay creates an overlay with the upper layer using tmpfs (which is internal to the sandbox). The intent here is to keep all file system modifications inside the sandbox. It's not suitable for what you're trying to do. For now, you could mount the overlay outside the sandbox and mount the merged layer inside the sandbox:

mount -t overlay overlay -o lowerdir=/tmp/lower,upperdir=/tmpm/upper,workdir=/tmp/work /tmp/ov

Then add the following to the spec:

  {
      "destination": "/ov",
      "source": "/tmp/ov",
      "type": "bind",
      "options": ["rw"]
    }

Nov 10 '20 01:11 fvoznika

Hi folks. This is a feature I'm very much interested in. :) The ability to specify overlay mounts with chosen paths as layers from the host system is an incredible superpower for assembling filesystems efficiently.

To recap on goalposts:

The options listed in the initial post -- especially, "lowerdir=/tmp/lower" -- are a nice clear statement of the goalpost! Overlay mounts are a pretty well known feature and something users of container systems often expect to be supported. I think it aligns well within the goals and scope of runsc to support an "overlay" mount type, with options such as "lowerdir", in a way that matches what e.g. runc will do via the overlayfs feature of the host kernel.
Ability to specify overlay mounts in the OCI mount spec section is distinct from turning on a global overlay system. User specified paths are an important part of what this feature should support. (And the closer this is to 1:1 drop-in parity with runc or other container systems using the host kernel's implementation of an overlay mount system, the better!)
Overlay mounts should also (ideally) support multiple lowerdirs, a la "lowerdir=/a:/b", as discussed in https://docs.kernel.org/filesystems/overlayfs.html#multiple-lower-layers .
And personally, I'm also trying to get this working while in --rootless mode! (I don't think this has noticeably raised the challenge level, fortunately.)

I've attempted to take a little bit of a peek at how close to possible this already is. I think the answer might be "close!" :D

Here's my notes, and some partial progress I've experimented with so far:

The gvisor.dev/gvisor/pkg/sentry/fsimpl/overlay package looks like it already contains a very large amount of the work! Amazing! :star_struck: It looks like largely a question of how to get that wired up such that the OCI mount spec is allowed to use it!
The first intervention I see needed is in runsc/boot/vfs.go in func getMountNameAndOptions: that function contains switch statement that's an allowlist for mount types the OCI spec can provide. We essentially just need to add a case to it for the string "overlay"! (And spec'ing some SupportedMountOptions in the overlay package, and wiring it up with some consumeMountOptions stuff, like the other cases have; small potatoes. I sketched this out already and it seems to work.)
That's very nearly it. (?! already? yes, I think so!)
- Once we pass the options through... the "lowerdir" option is already understood by the overlay package. Other options like "workdir" can get passed through as well; and where they aren't relevant to gvisor's overlay system, it already knowingly ignores them. (Seems fine to me!)
Except...

There's two more problems (that I'm aware of):

1st (but less importantly) some part of the mount processing code is eagerly transforming the "source" part of the OCI mount spec into an absolute path. That's not the right thing to do for overlay mounts -- the source string for those is typically "none" or "overlay" (and the closest thing to a real "source" path is the lowerdirs, in the options). This is a cosmetic mistake only, as far as I can tell, but it looks funny in the logs.
2nd, and where I'm currently stuck: when the runsc/boot/vfs.go code is putting together all the mounts specified by the OCI mounts spec... things go south. I have an error in the logs roughly like:

overlay.go:215] overlay.FilesystemType.GetFilesystem: failed to resolve upperdir "/my/host/specific/full/path/upper": no such file or directory

The path that appears there is a correct and real path on my host system. So what's going on?

I'm pretty sure this is because at that point, the runsc process attempting to put these mounts together is already partially sandboxed... and so we needs some logic in the gofer process, so that it makes those paths for upperdir and lowedir available to the sandbox depth we're in already. Currently, that's not happening, because it would require the gofer to understand paths within the options strings of overlay mount specs deserve to be exposed to the sandbox -- and (understandably) I don't think it does that yet.

That does seem like a solvable problem! I haven't found the exact point in the code to make that intervention yet, though. (Pointers welcome, if someone takes a peek at this and is already familiar with the system!)

That's what I know today! I'm happy to poke at hacking at this more. I may cycle back on my own as time allows, but some pointers would be super duper welcome to speed things along.

Dec 06 '23 00:12 warpfork

Hi @warpfork!

Thanks for looking into this. You are looking in all the right places. To reiterate,gvisor.dev/gvisor/pkg/sentry/fsimpl/overlay is our overlay filesystem driver (similar to fs/overlay in Linux). The code you are looking at in runsc/boot/vfs.go is the filesystem setup code which is run once the sandbox process has been set up (seccomp filters have been installed, capabilities have been applied, etc) and the gofer process has started. It sets up the container's filesystem tree as described in the spec and initializes all mounts (which may in turn connect with their respective gofer endpoints).

There are 2 major blockers I see, beyond what you have already implemented (as described above):

Gofer process needs to expose lowerdir/upperdir for overlay mounts by bind mounting them somewhere inside the gofer chroot.

I'm pretty sure this is because at that point, the runsc process attempting to put these mounts together is already partially sandboxed... and so we needs some logic in the gofer process, so that it makes those paths for upperdir and lowedir available to the sandbox depth we're in already. Currently, that's not happening, because it would require the gofer to understand paths within the options strings of overlay mount specs deserve to be exposed to the sandbox -- and (understandably) I don't think it does that yet.

This is correct, you pretty much figured it out. You need to look at https://github.com/google/gvisor/blob/f4b851067a3ad746990a73d9d772717e32394a25/runsc/cmd/gofer.go#L361

It creates bind mounts for all type=bind mounts in the spec (see g.setupMounts()). Note that mounts are placed in their respective destinations, so the resultant gofer chroot looks like the container filesystem tree. The upperdir/lowerdir do not have a "destination" inside the container filesystem tree, so it is an open question where we will place them.

All host filesystems are configured as "gofer mounts" in gVisor sandbox. If upperdir is a host path, then it needs to be backed by a gofer mount. However, using gofer mount as an upper layer is currently unsupported. Currently, only using type=tmpfs mounts as upper layer in overlayfs is supported. We need to add support for the following features in the gofer filesystem driver (gvisor.dev/gvisor/pkg/sentry/fsimpl/gofer), as these are required by the overlay driver:
1. Whiteout devices creation. Overlayfs driver creates a whiteout device (character device with 0/0 device number) in the upper layer to represent deleted files. fsimpl/gofer currently does not allow device creation. Adding support should be simple though. @avagin pointed out that whiteout device creation is exempted from CAP_MKNOD check (see fs/namei.c:vfs_mknod()) and device cgroup check (see include/linux/device_cgroup.h:devcgroup_inode_mknod()), so we don't need to worry about configuring device cgroup or capabilities. We can selectively only allow mknodat(2) syscall with dev=0 in our seccomp filters.
2. Xattr support. Overlayfs driver wants to set trusted.overlay.opaque attribute for opaque directories. But fsgofer/directfs currently does not support any xattr operations (probably for security reasons; to avoid having to allow these syscalls in the seccomp filter). We will need to add support (however, will require loosening the seccomp filter).

Dec 08 '23 19:12 ayushr2

Actually there may be a much simpler way to support this. Since both upper and lower layers are host directories, we can resolve the overlay-ness in the gofer itself and serve this as a normal gofer mount. In the gofer (in g.setupMounts()), we can create an overlay mount (using the host overlay driver) at the appropriate destination inside the container filesystem. Then we just need to update runsc to consider type=overlay as a normal gofer mount and serve it using a gofer connection.

This way we bypass the above mentioned issues/feature gaps. Since both layers are host filesystems, there is no benefit in overlaying them inside the sentry using fsimpl/overlay. The sentry overlay driver is used to stitch sentry-internal layers (like tmpfs) with gofer layers. Instead, we rely on the host to overlay them.

Dec 09 '23 09:12 ayushr2

Hey, thanks so much for your replies! My own here now took a second to marshal because those are both valid design plans, but also they have some really interesting potential for distinct outcomes.

Labelling to help make the discussion easier:

Approach A: gofer understands overlay options, gets all necessary component filesystems mounted (somewhere not in the path destined to become the rootfs), and the overlay itself is performed using gvisor's code.
Approach B: the gofer process issues an overlay mount syscall once it's established in enough namespaces to do so. The host kernel's overlayfs implementation is used.

First of all: yes, I agree that both approaches should work.

Approach A has gvisor shouldering slightly more development work. (But, a remarkable amount of which has already been done!)

Approach A also has the potential advantage of greater portability. Approach B, relying on the host's overlayfs, makes runsc as a whole somewhat more subject to any potential quirks of the host kernel's overlayfs feature.

In performance, I guess the host kernel's overlay code might have a slight advantage, because presumably it can read the lower filesystems without doing extra userspace/kernelspace switching. I don't know if that would be significant, nor if it's an important trade vs things like portability.

So about that portability. That maybe sounds like a curveball, and "surely that's not a significant concern", but I have to confess, it's actually kinda why I'm looking closely at gvisor. The portability of overlayfs, especially with rootless containers, has been incredibly... Let's call it "sus", in my experience. I'd actually love it if gvisor could present a more consistent experience than the host kernel itself. I'd even accept fairly significant deviations of semantics in order to get that, personally.

Some examples:

overlayfs sometimes needs an option of "xuserattrs" in order to work correctly in rootless containers. On other host kernel versions, the overlay works fine, even in a rootless context, without that attribute... but providing that attribute is invalid and causes your mount to fail. (I have an example of each, on my desk right now, I regret to report.)
overlayfs fails to work (in very poorly reported ways!) on some combinations of host filesystems. For example, using a tmpfs in (some of!) the layers fails. So does using ZFS. And this is not an exclusive list. Many footguns here.
Sometimes when I say "poorly reported" failures, I just mean "you have to grep dmesg or other corners to find out why it failed", but at least the mount did fail. I consider that poor because it's hard to automatically re-associate those failure details with the action, so it's hard to report them clearly to the user automatically. But sometimes it's much worse...
Sometimes overlayfs creates a mount, but and it largely appears to work... Until you try to remove a file. Then it produces an I/O error inside the container, to those user processes. This is catastrophically confusing and hard to wrap a good experience around.

So, again, to summarize, overlayfs as provided by the host can actually be quite a bumpy zone. (I almost miss the days of AUFS.)

I think a pure gvisor implementation of overlay could be a powerful way to offer a stable and consistent behavior that's actually an improvement on what the host can provide.

On the other hand, in Approach B, using the host's overlayfs, even there it might behoove gvisor to be aware of these potential issues. It would be valuable to teach the gofer process how to detect and flag these issues in a clear way, so that the user experience can be good even when things aren't in a fully functional setup and we have to request intervention.

(I should probably expand a bit on that mention of potentially acceptable deviations of semantics with examples, too. I'm thinking about implementation details like the exact usages of xattrs and device nodes with certain numbers. Those are some very interesting choices that the kernel's overlayfs implementation made... and I don't think most users care. At the same time, I suspect a lot of these odd edge cases, quirks, and incredibly-unfortunate-incomposabilities in the kernel's overlayfs come down to some those very same choices in implementation detail. So, making different choices in a gvisor overlay system that's semantic, rather than quirk-for-quirk identical -- I strongly suspect would probably be acceptable to users. And it might even simplify the implementation roadmap.)

(Another thing the kernel's overlayfs spends a great deal of attention on, where I think many users could also care less, are the st_ino, st_dev, and d_ino numbers in stat and readdir. Apparently it's the source of a lot of kerfuffle in the kernel's overlayfs -- there's a whole table about its conditional functioning. I think another implementation of a semantic overlayfs should feel entirely free to disregard all of this, in exchange for simplification, for portability, or for any reason at all, really.)

Is there a "both" option?

My impression and vote, fwiw, would be "sure!"

The stuff I care about is getting a union of lowerdirs, and an upperdir. To me, basically everything else is optional.

So, for example, if gvisor parsed the OCI mount spec and decided "overlay" means the abstract behavior, and defaulted to using the gvisor implementation... I'd be fine with that. If there was then a special option for "usehost", and that's there to support people who really definitely want the host overlay logic, I'd also be fine with that. There would be other ways to tuck in options that would also seem fine.

Having "close" compatibility to a spec that would also drop-in work in runc is nice to have, but I think some slippage in order to have good easily-portable/consistent default behavior, AND have a way to unlock more host specific behavior at your option, is probably a fine trade.

Sorry for the long post, thanks for your attention!

Dec 19 '23 13:12 warpfork

Your understanding of the two approaches is correct.

Approach B, relying on the host's overlayfs, makes runsc as a whole somewhat more subject to any potential quirks of the host kernel's overlayfs feature.

I'd actually love it if gvisor could present a more consistent experience than the host kernel itself. I'd even accept fairly significant deviations of semantics in order to get that, personally.

Linux compatibility is an important goal for gVisor. We aim to be a lift-and-shift container sandboxing solution. Ideally, an application should be able to run unmodified inside gVisor sandbox. Most applications running on gVisor (like on Cloud Run) are developed and tested outside of gVisor (on Linux systems) and users deploy them in gVisor (e.g. deploying in Cloud Run with the first generation execution environment). Hence, it is important that gVisor remains compatible with Linux. We are constantly working on closing this compatibility gap with Linux. So deviating from Linux behavior might be an anti goal.

Furthermore, if the said rootless container was to be executed without gVisor, then runc would configure the container overlayfs using host's overlayfs. So Approach B seems more consistent with runc AND easier to implement.

overlayfs sometimes needs an option of "xuserattrs" in order to work correctly in rootless containers

I can not find xuserattrs in the upstream kernel: https://elixir.bootlin.com/linux/latest/A/ident/xuserattrs. Could you point to documentation of this option?

overlayfs fails to work (in very poorly reported ways!) on some combinations of host filesystems.

I see. I think it is important to understand exactly why a tmpfs layer would fail to mount under what conditions (is it a bug or intentional?). And subsequently if it is possible to avoid such failures in gVisor without breaking Linux compatibility.

Sometimes overlayfs creates a mount, but and it largely appears to work... Until you try to remove a file.

Again, I think we need to understand why exactly Linux is failing the "remove file" operation (bug or WAI?). Maybe it is an issue with how overlayfs is being configured. I can try to look if you have a reproducer for this.

Those are some very interesting choices that the kernel's overlayfs implementation made... and I don't think most users care. At the same time, I suspect a lot of these odd edge cases, quirks, and incredibly-unfortunate-incomposabilities in the kernel's overlayfs come down to some those very same choices in implementation detail. So, making different choices in a gvisor overlay system that's semantic, rather than quirk-for-quirk identical -- I strongly suspect would probably be acceptable to users. Another thing the kernel's overlayfs spends a great deal of attention on, where I think many users could also care less, are the st_ino, st_dev, and d_ino numbers in stat and readdir.

We repeatedly get to see Hyrum's Law at play. It is surprising to see a lot of applications depend on small implementation details in Linux, and gVisor has to mimic that behavior for compatibility, so that the application can work unmodified in gVisor environment. Side stepping inconvenient features and creating seemingly useful deviations might end up biting us later.

Is there a "both" option? My impression and vote, fwiw, would be "sure!"

I think the way to do this would be defaulting to the host overlay solution. And having gvisor specific overlay mount options that can switch to using our overlay implementation. Additionally, we would require separate options to enable/disable each quirk/feature (defaulting to what Linux does).

Jan 08 '24 17:01 ayushr2

+1 generally to what @ayushr2 said; generally, gVisor aims for bug-for-bug compatibility with Linux by default, rather than a more-semantically-correct-but-not-actually-identical implementation. (I believe @ayushr2 may have some war stories to share here...)

IMO, the correct layer for making overlayfs less footguny would be a fuse daemon implementing overlaying logic in userspace, which would solve the problem both in and out of gVisor. But if performance is critical, having a gVisor-specific mount option that the user has to explicitly specify would be OK too, since at that point the user should no longer expect Linux-like behavior.

Jan 09 '24 18:01 EtiennePerot

gvisor gvisor copied to clipboard

OCI: Support `overlay` filesystem type

gvisor
gvisor copied to clipboard