bubblewrap bwrap can exit while overlayfs upper directory is still busy, preventing reuse in a subsequent command

Greetings,

I have encountered some issues when using bwrap's native overlay options in constrast to fuse-overlayfs, these issues stem from the lack of a lazy unmount option. Bwrap overlayfs upper directory stays busy for too long, and when used in scripts some sequential commands fail unless there is an explicit delay with sleep, I could not wait with lsof, maybe because the mountpoint can only be seen in bwrap's namespace. Could it be possible for lazy un-mounting (MNT_DETACH) to be an option?

Best,

Nov 14 '24 01:11 ruanformigoni

bwrap doesn't actually unmount anything: its processes just exit, and let the kernel do the cleanup when the number of processes inside the sandbox drops from 1 to 0. This ensures that the same code path is used for graceful exit or a crash, but it means there is nowhere that we could put different unmount options.

So I'm going to repurpose this bug report to be "Bwrap overlayfs upper directory stays busy for too long": the specific solution you're proposing is not available, but perhaps a different solution would be.

At the moment, in a typical use of bwrap with --unshare-pids, we have these processes, where the "monitor" is a direct child of your script:

bwrap monitor
    |
. . | . . . . . . sandbox boundary . . . . .
    |
    \- bwrap init process (pid 1 inside sandbox)
        |
        \- sandbox's main process (pid 2 inside sandbox, the COMMAND from the bwrap CLI)
            |
            \- ... maybe other processes ...

(One of the things on my extensive to-do list is to add a diagram similar to this one to the source code so that we can have a common set of terms!)

From information on other issue reports, I think there might perhaps be a race condition where the monitor can exit (returning control to your script) before the init process has completely exited, which means that the overlayfs upper directory is still busy when your script moves on to the next thing that it wants to do. (Or perhaps that race condition doesn't exist, I'm not 100% sure.) If this race condition exists, then solving it would need some sort of synchronization between the init process and the monitor, so that the monitor waits until the init process has completely exited, and the init process in turn waits for all of the other processes to have finished.

A complicating factor is that the bwrap init process doesn't always exist - depending on exactly which options you've used, the COMMAND might be the initial process in the sandbox, in which case I think the monitor will exit as soon as the COMMAND has exited, but there could still be processes in the sandbox which haven't been cleaned up yet. So that might be the issue that you are seeing. If so, you could avoid it by using --lock-file to make some file to be locked, which has the side-effect of creating an init process to hold the lock. Or you could use --sync-fd, but that's probably hard to use from shell script (although increasingly, my preferred solution to that is "don't write shell scripts").

Nov 14 '24 11:11 smcv

these issues stem from the lack of a lazy unmount option

I would actually have expected that unmounting the overlayfs lazily would make this issue worse, by having the upper directory be in use for longer than the filesystem is mounted (while taking away your ability to check whether it's still in use).

Nov 14 '24 11:11 smcv

Thanks for the detailed explanation, I suggested the lazy option because it frees up the mount point to use (which works on fuse-overlayfs to avoid this issue), but I guess that would not fix the problem here since it is the upper directory that is busy. A current workaround I use for this with minimal changes to bwrap itself is to exit(errno) instead of exit(1) here.

https://github.com/containers/bubblewrap/blob/9ca3b05ec787acfb4b17bed37db5719fa777834f/utils.c#L85-L97

Then if the exit code is 16 EBUSY, I can just retry n times between a time interval. And I also tried the --lock-file option, but that did not seem to create an init process, inside the sandbox the bwrap command itself was it. I'm using bwrap in C++, how would I proceed to use --sync-fd? Do I open the file descriptor outside the sandbox, pass it as an argument and wait for it to close (again outside the sandbox)?

Finished testing --sync-fd with pipe, fork, dup2 and read. It does work to wait for the sandbox to exit on the parent process but that does not wait for overlayfs yet, I suspect If I can get --lock-file to create the init process it would work. Can the target file be any file for the --lock-file argument? I'm using a file in /tmp accessible from the host and the sandbox.

Nov 15 '24 16:11 ruanformigoni

that did not seem to create an init process, inside the sandbox the bwrap command itself was it

The process name will still be bwrap, the difference is the number of processes that appear:

If there's an init process, you will see a bwrap (outside the sandbox), with a bwrap child (inside the sandbox), with your COMMAND as a grandchild of the original bwrap.

If not, you will see a bwrap (outside the sandbox) with your COMMAND as a direct child process.

I'm using bwrap in C++, how would I proceed to use --sync-fd? Do I open the file descriptor outside the sandbox, pass it as an argument and wait for it to close (again outside the sandbox)?

The short version is "look at how Flatpak does it". Flatpak is C code internally, it shouldn't be too far away from your C++.

You open a pipe(), give the writable end to bwrap (and close it in the parent process), keep the readable end for yourself (don't let bwrap inherit it), and monitor the readable end with poll() or similar. When bwrap exits, the write end closes, which means the readable end of the pipe becomes ready for reading (it will reach EOF). Or you could use an eventfd or a socketpair or something like that instead of a pipe if you prefer.

Nov 15 '24 18:11 smcv

exit(errno) instead of exit(1) here

Sorry, we're unlikely to apply that change. Having bwrap exit with exit status 16 (or whatever) is indistinguishable from what happens when the user-specified COMMAND exits with status 16.

It's already not ideal that you can't tell the difference between bwrap ... -- false exiting with status 1 because bwrap failed and exited with status 1, or because false exited with status 1 like it's designed to. If we had a time machine, bwrap would probably exit 125 on internal errors, like env(1) does.

Nov 15 '24 18:11 smcv