bubblewrap
bubblewrap copied to clipboard
O(N^2) performance in the number of "--bind" flags
When bwrap is called with thousands of "--bind" arguments on the command line, it slows down considerably. It looks like /proc/self/mountinfo is parsed for every single "--bind" flag. Must that happen? Can the code instead parse the mount table once and update its representation over time?
Some more information ....
As bwrap processes each "--bind" flags it parses /proc/self/mountinfo. After every bind mount that file is longer by one line ... that's what leads to N^2 behavior.
I ran a test where I pass bwrap 2000 "--bind" flags ... bwrap --dev-bind / / --bind /usr /tmp/1 --bind /usr /tmp/2 ... --bind /usr /tmp/2000 /bin/true
The program takes 6 or 7 seconds to run. If I hack the code to re-use information collected from parsing /proc/self/mountinfo, the same command takes about 150 milliseconds to run.
You could open a PR, so your change could (presumably) be subject to review.
Here is a pull request I cooked up to solve the problem. It likely has some bugs. I'd like to work with the bubblewrap maintainers to land it. Thanks!.
I have updated the pull request: https://github.com/containers/bubblewrap/pull/385 I've done some testing, so it is now much more likely to be correct.
It kinda does this intentionally, because its pretty hard to know exactly how a recursive bind operation affected the system, especially since other things could be happening in parallel (like e.g. mount propagation). It is particularly sensitive to get this right in the case where bubblewrap is setuid as getting this wrong might be a possible attack vector.
Not saying it is impossible, but it needs to be super careful.
Is there a different strategy that might work to reduce the number of times the mountinfo file is read? I don't fully understand mount propagation and the need to adjust flags for recursive mounts. My use case is a build system, where bubblewrap is used to create a sandbox. It's not unusual to bind mount 2K paths, which takes a really long time with the current implementation.
On Tue, Aug 25, 2020 at 9:04 AM Alexander Larsson [email protected] wrote:
It kinda does this intentionally, because its pretty hard to know exactly how a recursive bind operation affected the system, especially since other things could be happening in parallel (like e.g. mount propagation). It is particularly sensitive to get this right in the case where bubblewrap is setuid as getting this wrong might be a possible attack vector.
Not saying it is impossible, but it needs to be super careful.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/containers/bubblewrap/issues/384#issuecomment-680011802, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGVMPBBPXWYQBUDS34B33SCOZF5ANCNFSM4P6YNR5A .
cc @kolyshkin - do you have ideas on how to robustly maintain some in-memory idea of what mountinfo looks like without rereading mountinfo on each mount operation?
Ahh, what a nice case.
I guess the problem here is
- every bind mount might result in multiple changes in mountinfo (depending on mount propagation and recursiveness);
- the order of mounts is important.
So, to avoid re-reading the mountinfo, we have to replicate the kernel logic of propagating the mounts, and this seems like being too much. Currently, though, it's the only way to go.
The only other thing I can think of is trying to optimize the parser, and maybe finding some specific cases in which we can avoid re-reading mountinfo.
From the perspective of a user,
It's not unusual to bind mount 2K paths
it might make sense to figure out a way to decrease the number of bind mounts.
Something that might help here: in our use case, we know that each directory being bind-mounted has no nested mounts inside it, and we don't care about propagation to/from it (i.e., it would be fine to unconditionally mark them as private).
Could we make bind_mount
detect whether something being mounted in has any current nested mounts, and avoid rereading mountinfo if so? That way we know we don't have to do any calculation on our own that affects things when doing a series of such mounts.
Or perhaps a separate option --bind-nonrecursive
(which fails if there's anything inside it)?
Could we make bind_mount detect whether something being mounted in has any current nested mounts
funny, but such detection requires reading mountinfo, so it does not make sense (reading mountinfo to avoid reading mountinfo).
Or perhaps a separate option --bind-nonrecursive (which fails if there's anything inside it)?
It seems like in case of non-recursive bind mount we can avoid reading mountinfo, as the only thing we need is flags and it can be obtained via statfs(2).
Yes, but you can do it once - i.e. if you have bwrap --bind /1 /1 ... --bind /2000 /2000
, you can read mountinfo once, determine that 1 through 2000 are all non-nested, and then bind-mount them all without rereading mountinfo.
Also I think you can detect it by trying the mount without MS_REC and seeing if it works, right? It will fail if the mount has other mounts inside it and you aren't privileged over the original mount namespace, so potentially we could do something like
for (...) {
if (mount(source, target, NULL, MS_BIND | MS_PRIVATE, NULL) == 0)
continue;
if (errno == EINVAL) {
if (mount(source, target, NULL, MS_BIND | MS_REC | MS_PRIVATE, NULL) == 0) {
reread_mountinfo();
continue;
}
}
return -1;
}
I believe at this point in the code, since we've made a new mount namespace, we know we're not privileged over the original one.
I was also surprised skimming through the code that every bind_mount
call goes over mountinfo.
My use case is as follows: I want to replace --dev-bind / /
with --dev-bind /bin /bin --dev-bind /usr /usr ...
for every dir entry of /
, so that I can bind a new path --bind ./hello-world /hello-world
at the root of my filesystem.
It'd be nice if there was a flag like --dev-bind-dir-entries / / --bind ./hello-world /hello-world
or so, that does this efficiently
Created pull requests with possible solution: #629 #630 Approach is similar, data structures are little bit different