trio
trio copied to clipboard
Using one process's output as another's input creates challenges with non-blocking status
When launching piped subprocesses, trio provides the stdin of the 2nd process (at least, I have not checked if others are concerned) as O_NONBLOCK while the shell or Python subprocess.Popen() do not (on a Linux platform).
See attachment pipe_nonblock.zip for the scripts used.
$ echo foo | ./desc_stdin.py
stdin F_GETFL 0
$ ./python_launch_pipe.py
p2 stderr: data=b'stdin F_GETFL 0\n'
main end
$ ./trio_launch_pipe.py
p2 stderr: data=b'stdin F_GETFL 2048\n' <-- Here O_NONBLOCK is set
main end
I was expecting that by default, for a new process, all IOs were blocking. As a workaround, I will have to explicitly clear the O_NONBLOCK flag in my spawned exe, but I believe it can create problems with a variety of applications (that are expecting blocking IO by default).
Huh, that's an interesting case! I'm really not sure what the correct behavior here is.
What makes it tricky is that:
-
The parent process holds onto a reference to the pipe that it passes in to the second child, so in principle it could continue to read/write from it as well via its own
FdStream
object, and thatFdStream
requires that the fd be in non-blocking mode -
In principle, you might want to explicitly set an fd to non-blocking before passing it in. This would only happen in a super-exotic case, but in general we do want to make super-exotic cases at least possible to handle. So maybe it would be bad to unconditionally remove the O_NONBLOCK flag when spawning a process?
One option: have a special case where if you pass in an FdStream
object as a new process's stdin/stdout/stderr, then open_process
sets it to blocking + closes the FdStream
in the parent process. (And if someone wants the super-exotic case of passing in the fd in raw non-blocking mode, then they can pass a raw file descriptor instead.)
This feels... odd, but also convenient.
...That might also let us simplify the fd creation code a bit, because stdin=PIPE
would become equivalent to a, b = make_pipe(); stdin=b
.
When you say stdin=p1.stdout
when constructing p2, that makes p2's stdin use the same open file description as p1's stdout. p1's stdout is also still available to be accessed as a Trio stream, for which purpose it must be in non-blocking mode. The non-blocking status is a property of the open file description, not of the file descriptor, so it's the same for Trio and for the child process. Result: one or the other is going to be unhappy.
I'm not sure how best to resolve this. As a workaround, you could use PIPE for both processes and spawn a Trio task to shovel data from one to the other. That's somewhat clumsy, though. Thoughts on a real solution...
- On Linux we can open /proc/self/fd/N to get another open file description for the same pipe, with independent non-blockingness, but that doesn't work on other systems.
- We could use socketpairs for process I/O instead of pipes (thus allowing individual calls to use MSG_NOWAIT), but that's an unusual choice and might confuse some subprocesses.
- We could provide some way to get our end of the pipe to not be put in nonblocking mode at all. Like p1
stdout=trio.BLOCKING_PIPE
instead ofsubprocess.PIPE
or something. That's kind of clumsy though. - We could detect the case where a trio FdStream is passed as one of the I/O descriptors for a child process, switch the underlying FD back to blocking mode and poison the stream against future uses within Trio (with a helpful error message explaining the situation). That's a little magical, but it would provide the best UI for people who expect their subprocess patterns to translate over unmodified.
@njsmith, thoughts?
Code so others don't have to unpack the zip file:
trio_launch_pipe.py:
#! /usr/bin/python3
import subprocess
import trio
async def main():
"""Main."""
cmd1 = ("echo", "foo")
cmd2 = ("./desc_stdin.py",)
async with await trio.open_process(cmd1, stdout=subprocess.PIPE) as p1:
async with await trio.open_process(
cmd2, stdin=p1.stdout, stderr=subprocess.PIPE, stdout=subprocess.DEVNULL
) as p2:
async with p2.stderr:
async for data in p2.stderr:
print(f"p2 stderr: {data=}")
print("main end")
if __name__ == "__main__":
trio.run(main)
desc_stdin.py:
#! /usr/bin/python3
import fcntl
import sys
print("stdin F_GETFL", fcntl.fcntl(0, fcntl.F_GETFL), file=sys.stderr)
Huh, it seems like we independently converged on the same solution. :-) Probably special handling for FdStream arguments is the way to go then!
Portability is an interesting point... on Windows this problem doesn't happen at all, and on Linux like you note there's a simple workaround (modulo any exotic environments where /dev
and /proc
are inaccessible). Does macOS support MSG_NOWAIT
? I feel like I looked into this at some point and it might not exist there?
It's not documented but it does seem to work:
% python3.8
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket; socket.MSG_DONTWAIT
<MsgFlag.MSG_DONTWAIT: 128>
>>> a, b = socket.socketpair()
>>> b.recv(128)
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyboardInterrupt
>>> b.recv(128, socket.MSG_DONTWAIT)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
BlockingIOError: [Errno 35] Resource temporarily unavailable
>>>
Following your discussion, os.set_blocking(p1.stdout.fileno(), True)
between the 2 trio.open_process can be used as a workaround.
Using socket.socketpair() instead of a pipe may be a problem for some specific applications (splice system call may not work for example).
@mic006
Following your discussion,
os.set_blocking(p1.stdout.fileno(), True)
between the 2 trio.open_process can be used as a workaround.
Right, that's a good workaround for right now, but it's (1) awkward to force users to do that, (2) if you then try to access p1.stdout
from the parent, then the parent will lock up, because Trio will try to access the fd in a non-blocking manner, but the fd will be in blocking mode. Not a problem for your use case b/c you're not going to use p1.stdout
in the parent, but it would be nice if we could figure out an API design that didn't leave this footgun lying around.
Using
socket.socketpair()
instead of a pipe may be a problem for some specific applications (splice system call may not work for example).
splice
should be OK, because we'd only need to use socketpair
on macOS/BSDs, and those don't have splice
:-). But yeah, in general I'm wary. Using a socket instead of a pipe should just work, but it's not a common thing to do so we might well discover a year later that there's some quirky program that objects for some obscure reason and then have to start over.
I guess one hacky but workable option would be to put p1.stdout
into a mode where it toggles O_NONBLOCK
on-and-off-again every time the parent process tries to use it.
Some other cases worth at least thinking about:
- Piping multiple processes into one output process:
(foo & bar) | baz
- Once we have stdio support in the parent (#174), we will want to think about what happens if you pass those explicitly to a child process (instead of relying on implicit inheritance).
I guess one hacky but workable option would be to put p1.stdout into a mode where it toggles O_NONBLOCK on-and-off-again every time the parent process tries to use it.
...and we could probably use the same code for #174, now that I think about it.
Does it even make sense to keep the stdin or stdout opened in the python process once it has been used to spawn another process ? To make the piping working properly, I need to explicitly call aclose() to have the pipe closed. Somehow, once a 2nd process have been started with the pipe of a previous, the python process loses the responsibility of it, no ? Or is there some more complex use cases where it can be useful ?
We ran into this at $work today. In our situation, explicitly setting the pipe back to blocking mode helped, as would the "trio magically sets the stream back to blocking mode and turns it invalid" converged-solution. So my vote would be for that approach.
Another possible idea would be to add some special constant to trio
(akin to PIPE) that specifically evoked this magic (like stdout=trio.PIPE_NOT_FOR_USE_IN_TRIO_BUT_ANOTHER_SUBPROCESS_LATER
but ... better named). Or the converse, for cases where people do actually want it to stay accessible in trio stdout=trio.PIPE_WITH_FDSTREAM