SwiftTerm
SwiftTerm copied to clipboard
LocalProcess IO hangs with successive writes
Context
I'm working on an application that uses HeadlessTerminal and LocalProcess from SwiftTerm to run shell commands on behalf of a user. We need to be able to capture output of a command and execute several commands in parallel. To do this, we write commands to the terminal and redirect their output to a temp file, backgrounding the commands if necessary to parallelize.
The Bug
Intermittently, the pseudoterminal we are writing to will 'hang' and stop being able to send any new writes. This is what the logs look like when the debugIO property of LocalProcess is enabled:
[SEND-138] Queuing data to client: [40, 32, ... ]
[SEND-138] completed bytes=30495
[SEND-139] Queuing data to client: [40, 32, ... ]
[SEND-139] completed bytes=30495
[SEND-141] Queuing data to client: [40, 32, ... ]
[SEND-142] Queuing data to client: [40, 32, ... ]
[SEND-143] Queuing data to client: [40, 32, ... ]
[SEND-144] Queuing data to client: [40, 32, ... ]
...
The logging comes from here: https://github.com/migueldeicaza/SwiftTerm/blob/29e132a363bbc2652a4a672f7747123306ecb5e9/Sources/SwiftTerm/LocalProcess.swift#L86-L111
The issue starts with the Send-141 write, which will be queued using DispatchIO.write but the corresponding completion callback will never be run. And from this point forward, no other writes or reads will complete.
Our current understanding is that this behavior is triggered by running many writes in quick succession.
I've create a minimal reproducible example of the behavior in this repo: https://github.com/withfig/swiftterm-write-example
At a first glance, @krzyzanowskim suggested there's some thread unsafety from running the LocalProcess
writes on the concurrent DispatchQueue.global
queue. Unfortunately, swapping this out for a serial queue (see @krzyzanowskim's patch here) didn't remedy the error entirely, and the minimal example above still suffers the same issues even with the serial queue
Thank you for this great reproduction case.
Hello,
It looks like the problem is that DispatchIO.read
and write
are convenience functions that mess with the descriptor settings behind our back, and the interleaved operations might be trampling over these. It looks like these will not play well with bidirectional sockets:
After calling this method, the system takes control of the specified file descriptor until the handler block is enqueued. While it controls the file descriptor, the system may modify it on behalf of the application. For example, the system typically adds the O_NONBLOCK flag to ensure that any operations are non-blocking. During that time, it is an error for your application to modify the file descriptor directly. However, you may pass the file descriptor to this method or the read(fd:length:queue:handler:) function. You may also use the file descriptor to create a new dispatch I/O channel. The system relinquishes control of the file descriptor before calling your handler, so it is safe to modify the file descriptor again from your handler code.
I have a new variation of the LocalProcess that creates dedicated channels, and it seems to address these problems. I also use explicit queues, can you try it?
https://gist.github.com/migueldeicaza/73e36c2c98b0427f3197622bdcb3a801
@krzyzanowskim
Hello,
It looks like the problem is that
DispatchIO.read
andwrite
are convenience functions that mess with the descriptor settings behind our back, and the interleaved operations might be trampling over these. It looks like these will not play well with bidirectional sockets:After calling this method, the system takes control of the specified file descriptor until the handler block is enqueued. While it controls the file descriptor, the system may modify it on behalf of the application. For example, the system typically adds the O_NONBLOCK flag to ensure that any operations are non-blocking. During that time, it is an error for your application to modify the file descriptor directly. However, you may pass the file descriptor to this method or the read(fd:length:queue:handler:) function. You may also use the file descriptor to create a new dispatch I/O channel. The system relinquishes control of the file descriptor before calling your handler, so it is safe to modify the file descriptor again from your handler code.
I have a new variation of the LocalProcess that creates dedicated channels, and it seems to address these problems. I also use explicit queues, can you try it?
https://gist.github.com/migueldeicaza/73e36c2c98b0427f3197622bdcb3a801
@krzyzanowskim
You may want to use a lock GCD group around the ptr
in this code, I think you may race the read/write buffer as-is.
Thanks! This gist seems to fix the problem of hanging writes š
Successive calls to LocalProcess send
are still executed asynchronously however, which for large numbers of writes can result in interspersed writes so instead of writing echo hello > /tmp/150.txt\n
and then echo hello > /tmp/151.txt\n
we sometimes end up with something like echo hello > /tmp/150.txtecho\nhello > /tmp/151.txt\n
I imagine in most cases it is desirable for writes to complete before writing more data.
This behavior using your gist is available on this branch of the example repo: https://github.com/withfig/swiftterm-write-example/tree/channeled-local-process
Because of the queues used, the print statements for send completion do not show up but I imagine if they did they would not always appear in the same order in which the bytes to send were queued.
Very good observation on the race, it took me a while to see it, but it is right there in front of my eyes!
This is the current patch I have, I am in the process of downloading a working Xcode to test:
https://gist.github.com/migueldeicaza/0bb69eacc777a626dc0ba1a23d2e5421
Ok that patch is not qutie ready: the DispatchWrite does the write, but never invokes the callback.
That's because I did not understand the purpose of the callback.
Ok, couple of updates.
- The multi-queue approach for some reason causes control-c on the MacTerminal to not be delivered immediately under load (for example, running ls -lR /, and this causes the control-c to take a long time before it responds. Can take minutes to do so.
https://gist.github.com/migueldeicaza/4c8fb6497a3a06d10739f0401de1d6f9
This patch is for swifterm
- I tried going without the multi-queue approach, and just using a DispatchQueue just for reading and instead handling writers directly over the socket. This poses the problem that somehow the child process in the testcase hangs at around 8192 bytes of output produced - even if I have read all the data, so something is putting data that I am not reading out.
https://gist.github.com/migueldeicaza/773f6c79906de6dd9472cbc9b281719f
this patch is against the test case, to easily try out different versions of LocalProcess.
I am strangely invested in this ticket. Iād be happy to debug tomorrow when I get a chance. š
Not kidding though, I had a full day remote coding interview and during my lunch break I started reading through the gist for the bug. I think I spotted a couple actually. Will get back.
There is now a bug fix on main