ip2unix icon indicating copy to clipboard operation
ip2unix copied to clipboard

Socket gets unlinked when children closes the socket file descriptor

Open aszlig opened this issue 4 years ago • 11 comments

strace output provided by @riedel in https://github.com/nixcloud/ip2unix/pull/13#issuecomment-635532573:

https://gist.github.com/riedel/043936e3c80da0f2607d48d8205d63e0#file-rsession-strace-L3960

My answer:

Okay, so the interesting part here is that it happens prior to executing /bin/sh -c 'git "--version"', where all file descriptors are closed. However, since after the clone the file descriptor is still active in the parent, we should not unlink the socket file.

To fix this, we need to implement some kind of reference counting... oh geesh...

Originally posted by @aszlig in https://github.com/nixcloud/ip2unix/pull/13#issuecomment-635540831

aszlig avatar May 29 '20 02:05 aszlig

Some more thoughts on this:

The naive way to solve this would be to just wrap clone/clone3 (except if using CLONE_FILES or CLONE_VM), daemon and fork/vfork and make sure the the socket file descriptors controlled by ip2unix aren't unlinked when close is called in the child process.

This will probably work with the rsession case mentioned/reported above, since the clone call is followed up by execve (after closing the file descriptors ranging from 3 to 4095) and thus the socket is not really being used in the child process.

However, this falls short as soon as the child process is actively using the socket and it's closed in the parent, for example (in pseudocode):

fd = socket(...);
pid = fork();
if (pid == 0) { // children
    accept(fd, ...);
    ...
}
// parent
close(fd);
waitpid(pid, ...);

In this case, the socket would be unlinked in the parent way too early and would leave users baffled on why there is no socket file. Note that unlinking the socket doesn't cause any errors in the socket interface and it's a bit similar to our own blackhole rule action.

We could get around this by having a side-channel which keeps track of the processes using a socket controlled by us. One very error-prone, non-portable and racy way would be by checking /proc/PID/net/unix of the parent and/or other processes in the same pgroup. Another way would be to fork off a dedicated ip2unix process that's solely responsible for keeping track of socket file descriptors.

Both however doesn't sound like the right solution here and I'd rather have a more simple solution where we even wouldn't need to wrap fork or clone.

Another way would be to not unlink socket files at all or even add an option to allow the user to control the behaviour. Unfortunately, especially the former would constitute a backwards-incompatible change and I personally would also prefer not to have countless of old socket files laying around.

aszlig avatar Jun 01 '20 00:06 aszlig

I was trying to reproduce the bug reliably on centos:7 with glibc-2.17-307.el7.1.x86_64 using docker. One really needs to install git for the problem to appear (so it gets executed with all the sockets closed)

FROM centos:7
RUN yum install -y epel-release centos-release-scl
RUN yum install -y meson yaml-cpp-devel python3-pytest R devtoolset-7-gcc-c++ git
RUN yum install -y https://download2.rstudio.org/server/centos6/x86_64/rstudio-server-rhel-1.3.959-x86_64.rpm
COPY . /ip2unix
WORKDIR /ip2unix
RUN scl enable devtoolset-7 'meson build && ninja -C build && ninja -C build install'
#RUN ninja -C build test
RUN adduser testuser
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64
USER testuser
RUN R_HOME=/usr/lib64/R  R_SHARE_DIR=/usr/share/R R_INCLUDE_DIR=/usr/include/R R_DOC_DIR=/usr/share/doc/R-3.6.0 RSTUDIO_DEFAULT_R_VERSION_HOME=/usr/lib64/R  RSTUDIO_DEFAULT_R_VERSION=3.6.0 timeout 10 ip2unix -r path=/tmp/rsession.sock /usr/lib/rstudio-server/bin/rsession --standalone=1 --program-mode=server --log-stderr=1 --www-address=127.0.0.1 --www-port=12345 & sleep 3 && curl --unix-socket /tmp/rsession.sock http://127.0.0.1:12345

riedel avatar Jun 01 '20 13:06 riedel

@riedel: Just to make sure my assumptions are correct:

If you uncomment the following line:

https://github.com/nixcloud/ip2unix/blob/b54905a0699d2ff3f24d9e219655a3ad7c9b3d85/src/socket.cc#L352

... does it work?

aszlig avatar Jun 02 '20 11:06 aszlig

... does it work? yes!

riedel avatar Jun 02 '20 23:06 riedel

Comment moved to https://github.com/nixcloud/ip2unix/pull/23.

aszlig avatar Jul 13 '20 03:07 aszlig

... does it work? yes!

Hi, I can report that I got bitten by this bug and the fix above works for me too. Please please consider adding a command-line switch to enable this behavior for users who need it! Leaking socket inodes is better than not working at all.

Specifically, firefox has a low-level debugger protocol called "marionette" which only knows how to listen on TCP sockets. This is a lower-level and more powerful protocol than "chrome devtools" or "webdriver" (in fact, those are implemented using marionette). I don't want to leave such a powerful debugger interface to every random process on the same machine as the browser (including, potentially, website javascripts running inside the browser itself!) and ip2unix does a great job at that -- but it needs this patch. Details: https://bugzilla.mozilla.org/show_bug.cgi?id=1525126#c71

ghost avatar Nov 22 '21 03:11 ghost

Please please consider adding a command-line switch to enable this behavior for users who need it! Leaking socket inodes is better than not working at all.

The issue is that adding a command line switch doesn't necessarily make the problem go away, because people do not necessarily know that they need such a switch, so I think we need to have a better interim solution.

One that I could think of would be to unlink the socket file during bind instead of close, which would avoid some shady command line flag that we need to deprecate in version 3. We already do this whenever SO_REUSEADDR is set for the socket.

Long-term however I think we need to switch to fork+execve+wait instead of just execve right now, which we need to do anyway if we want to use seccomp BPF. This model would leave ip2unix running for the whole runtime of the program in question but the above issue would not apply because we can also track child processes.

aszlig avatar Nov 22 '21 13:11 aszlig

On Mon, 22 Nov 2021 05:44:35 -0800 aszlig @.***> wrote:

This model would leave ip2unix running for the whole runtime of the program in question

That would be a very different piece of software than the current ip2unix. Not having an extra PID floating around is a big motivation to use ip2unix instead of socat.

If you do switch to using an auxiliary daemon PID, please consider giving the new implementation a new name, or at the very least rename the existing software "old-ip2unix". Not everybody will want to switch to new-ip2unix.

The issue is that adding a command line switch doesn't necessarily make the problem go away, because people do not necessarily know that they need such a switch, so I think we need to have a better interim solution.

It simply isn't possible for old-ip2unix to unlink socket inodes automatically, correctly, in all cases. Only the kernel can do this sort of "when last dup()ed fd is closed, garbage collect" thing -- just like Linux does in its non-POSIX "abstract namespace" for AF_UNIX sockets.

Here is the case you can't handle: a server bind()s a socket, then fork()s, and then accept()s in both the parent and child process after the fork(). This is legitimate, and I'm sure somebody somewhere is using it to get a crude form of multiprocess load-balancing.

Since automatic and correct cleanup of socket inodes isn't a feature that old-ip2unix can guarantee, the solution is simple: just stop trying to guarantee it! Don't attempt any cleanup. Users are asking for these inodes to be created (it is the person invoking old-ip2unix who picks the socket name!), so they know these inodes are getting created, and therefore they know something needs to clean them up. Stop telling them that old-ip2unix will be that "something" for them. Users can add an extra line to their shell scripts to do that cleanup. How and when they do this can be application-specific.

If you want old-ip2unix to do the cleanup under certain limited conditions, offer those heuristic cleanups as non-default options:

--unlink-after-bind --unlink-after-first-accept --unlink-after-close-in-same-pid-that-called-bind --unlink-on-fridays-in-september-with-a-full-moon ...etc

If users don't discover these non-default options it's okay! Users adding these options take on the burden of making sure their use case matches the unlinking heuristic they have selected.

It's really just a perspective shift. Automatic cleanup can't be guaranteed correctly, so don't try.

  • a

ghost avatar Nov 25 '21 11:11 ghost

That would be a very different piece of software than the current ip2unix. Not having an extra PID floating around is a big motivation to use ip2unix instead of socat.

So what's the practical implication in your case?

I mean, I don't like it either, but if I think about my use cases, I can't find a single occasion where this would be an issue. To the contrary: With seccomp BPF, it works for statically linked programs as well.

If you do switch to using an auxiliary daemon PID, please consider giving the new implementation a new name, or at the very least rename the existing software "old-ip2unix". Not everybody will want to switch to new-ip2unix.

Well, that's what a new major version is for: It's simply version 3. People who still want the old behaviour can still use version 2.

Here is the case you can't handle: a server bind()s a socket, then fork()s, and then accept()s in both the parent and child process after the fork().

Why do we need to handle this? At the moment, the wrapper for accept only rewrites the peer address based on SO_PEERCRED, so it ~~shouldn't~~doesn't make any difference.

aszlig avatar Nov 25 '21 13:11 aszlig

On Thu, 25 Nov 2021 05:44:15 -0800 aszlig @.***> wrote:

Here is the case you can't handle: a server bind()s a socket, then fork()s, and then accept()s in both the parent and child process after the fork().

Why do we need to handle this?

In order to correctly emulate the behavior of AF_INET.

That is, after all, the goal of ip2unix, isn't it?

AF_INET will keep accepting connections when one of the two processes does a close() while the other is still in its accept() loop. You can't accept new connections on an AF_UNIX socket after a helpful cleanup heuristic obliterates the socket inode.

That would be a very different piece of software than the current ip2unix. Not having an extra PID floating around is a big motivation to use ip2unix instead of socat.

So what's the practical implication in your case?

Reliability.

Every time you add another layer of supervisor processes you multiply the number of failure modes. The obvious ones: child dies, parent doesn't; parent dies, child doesn't; child double-forks in order to deparent itself (do you plan on handling this? it is not easy) like how gpg "helpfully" double-forks gpg-agent and leaves it loitering. There are plenty more non-obvious failure modes.

You will be trying to add process supervision to programs that weren't designed with supervision in mind. This is unreliable, and it's why a lot of daemons had to have code added to disable double-forking and similar behaviors in order to get them to run under daemontools, runit, and s6. Later on systemd benefited greatly from all that upstreaming work that had already been done, nagging daemon-writers saying "hey, could you add a flag to make that double-fork() behavior not happen?".

And if the process running under ip2unix was designed for supervision, what if it needs to be run within some other kind of process supervision framework? Are you going to forward signals? Handle process groups and session groups correctly? Debugger attachments? All the things that make "su" and "sudo" more complex and less bulletproof than "chpst".

The unfortunate reality is that UNIX just doesn't have a universal way to know when a child process (including the child's children) is "done". You can't create wrappers for UNIX programs that rely on knowing when the wrapped program is "done" that work in the general case, because there's no way to know in general.

If you do switch to using an auxiliary daemon PID, please consider giving the new implementation a new name, or at the very least rename the existing software "old-ip2unix". Not everybody will want to switch to new-ip2unix.

Well, that's what a new major version is for: It's simply version 3. People who still want the old behaviour can still use version 2.

Setting aside for a moment the headaches this creates for distribution maintainers...

Once you've released your ip2unix-v3 and declared your intent to stop maintaining ip2unix, I assume you won't be the least bit offended if others pick up the now-abandoned-by-you fork and carry it forward. For example, removing the socket-inode-deletion heuristic. Of course since it is ip2unix, they'll call it ip2unix. And someday they will decide it's time to increment their version number...

ghost avatar Nov 29 '21 10:11 ghost

AF_INET will keep accepting connections when one of the two processes does a close() while the other is still in its accept() loop. You can't accept new connections on an AF_UNIX socket after a helpful cleanup heuristic obliterates the socket inode.

True but that's this very issue. Whether one would do accept on both processes or not would result in exactly the same issue and you don't need to get too theoretical in order to reproduce this. See the test case I mentioned earlier and also the test case I added in https://github.com/nixcloud/ip2unix/pull/23 for a more practical example modeled after a real-world issue.

Reliability.

This is something you can't have with LD_PRELOAD either and it's something that's even more intrusive and error-prone than having an extra process because you're actively sitting around in the program's memory. Not to mention all the corner cases that you need to take care of, eg. if you're running multiple versions of libc.

So the status quo already is a trade-off between reliability and barely working or even not working at all.

Adding the process mentioned earlier will move all the state handling code from the target program to the supervision process, so the amount of corner cases we need to handle will decrease rather than increase.

Side note when it comes to double-forking and all kinds of ways to escape the supervision: Since the goal is to switch to seccomp BPF for GNU/Linux (with a fallback to LD_PRELOAD) the program in question can't use double-fork to escape.

But speaking of compromises: If we would have a way to reliably detect whether we need a supervision process in the first place, we could do a hybrid approach.

Once you've released your ip2unix-v3 and declared your intent to stop maintaining ip2unix, I assume you won't be the least bit offended if others pick up the now-abandoned-by-you fork and carry it forward. For example, removing the socket-inode-deletion heuristic. Of course since it is ip2unix, they'll call it ip2unix. And someday they will decide it's time to increment their version number...

That's entirely fine and if someone hard-forks the project with the same name, it's up to the distributions to distinguish both variants. Most distributions out there do seem to handle that pretty well (for example consider all those netcat variants).

aszlig avatar Nov 30 '21 02:11 aszlig

Please please consider adding a command-line switch to enable this behavior for users who need it!

The non-solution I went with is basically adding a new flag called "noremove", which prevents the socket path from being unlinked when the socket is closed.

Thank you!

ghost avatar Aug 16 '23 02:08 ghost