SwiftTerm icon indicating copy to clipboard operation
SwiftTerm copied to clipboard

LocalProcess IO hangs with successive writes

Open sullivan-sean opened this issue 2 years ago ā€¢ 8 comments

Context

I'm working on an application that uses HeadlessTerminal and LocalProcess from SwiftTerm to run shell commands on behalf of a user. We need to be able to capture output of a command and execute several commands in parallel. To do this, we write commands to the terminal and redirect their output to a temp file, backgrounding the commands if necessary to parallelize.

The Bug

Intermittently, the pseudoterminal we are writing to will 'hang' and stop being able to send any new writes. This is what the logs look like when the debugIO property of LocalProcess is enabled:

[SEND-138] Queuing data to client: [40, 32, ... ]
[SEND-138] completed bytes=30495
[SEND-139] Queuing data to client: [40, 32, ... ]
[SEND-139] completed bytes=30495
[SEND-141] Queuing data to client: [40, 32, ... ]
[SEND-142] Queuing data to client: [40, 32, ... ]
[SEND-143] Queuing data to client: [40, 32, ... ]
[SEND-144] Queuing data to client: [40, 32, ... ]
...

The logging comes from here: https://github.com/migueldeicaza/SwiftTerm/blob/29e132a363bbc2652a4a672f7747123306ecb5e9/Sources/SwiftTerm/LocalProcess.swift#L86-L111

The issue starts with the Send-141 write, which will be queued using DispatchIO.write but the corresponding completion callback will never be run. And from this point forward, no other writes or reads will complete.

Our current understanding is that this behavior is triggered by running many writes in quick succession.

I've create a minimal reproducible example of the behavior in this repo: https://github.com/withfig/swiftterm-write-example

At a first glance, @krzyzanowskim suggested there's some thread unsafety from running the LocalProcess writes on the concurrent DispatchQueue.global queue. Unfortunately, swapping this out for a serial queue (see @krzyzanowskim's patch here) didn't remedy the error entirely, and the minimal example above still suffers the same issues even with the serial queue

sullivan-sean avatar Nov 16 '21 00:11 sullivan-sean

Thank you for this great reproduction case.

migueldeicaza avatar Nov 20 '21 16:11 migueldeicaza

Hello,

It looks like the problem is that DispatchIO.read and write are convenience functions that mess with the descriptor settings behind our back, and the interleaved operations might be trampling over these. It looks like these will not play well with bidirectional sockets:

After calling this method, the system takes control of the specified file descriptor until the handler block is enqueued. While it controls the file descriptor, the system may modify it on behalf of the application. For example, the system typically adds the O_NONBLOCK flag to ensure that any operations are non-blocking. During that time, it is an error for your application to modify the file descriptor directly. However, you may pass the file descriptor to this method or the read(fd:length:queue:handler:) function. You may also use the file descriptor to create a new dispatch I/O channel. The system relinquishes control of the file descriptor before calling your handler, so it is safe to modify the file descriptor again from your handler code.

I have a new variation of the LocalProcess that creates dedicated channels, and it seems to address these problems. I also use explicit queues, can you try it?

https://gist.github.com/migueldeicaza/73e36c2c98b0427f3197622bdcb3a801

@krzyzanowskim

migueldeicaza avatar Nov 20 '21 20:11 migueldeicaza

Hello,

It looks like the problem is that DispatchIO.read and write are convenience functions that mess with the descriptor settings behind our back, and the interleaved operations might be trampling over these. It looks like these will not play well with bidirectional sockets:

After calling this method, the system takes control of the specified file descriptor until the handler block is enqueued. While it controls the file descriptor, the system may modify it on behalf of the application. For example, the system typically adds the O_NONBLOCK flag to ensure that any operations are non-blocking. During that time, it is an error for your application to modify the file descriptor directly. However, you may pass the file descriptor to this method or the read(fd:length:queue:handler:) function. You may also use the file descriptor to create a new dispatch I/O channel. The system relinquishes control of the file descriptor before calling your handler, so it is safe to modify the file descriptor again from your handler code.

I have a new variation of the LocalProcess that creates dedicated channels, and it seems to address these problems. I also use explicit queues, can you try it?

https://gist.github.com/migueldeicaza/73e36c2c98b0427f3197622bdcb3a801

@krzyzanowskim

You may want to use a lock GCD group around the ptr in this code, I think you may race the read/write buffer as-is.

JoeMatt avatar Nov 21 '21 02:11 JoeMatt

Thanks! This gist seems to fix the problem of hanging writes šŸŽ‰

Successive calls to LocalProcess send are still executed asynchronously however, which for large numbers of writes can result in interspersed writes so instead of writing echo hello > /tmp/150.txt\n and then echo hello > /tmp/151.txt\n we sometimes end up with something like echo hello > /tmp/150.txtecho\nhello > /tmp/151.txt\n I imagine in most cases it is desirable for writes to complete before writing more data.

This behavior using your gist is available on this branch of the example repo: https://github.com/withfig/swiftterm-write-example/tree/channeled-local-process

Because of the queues used, the print statements for send completion do not show up but I imagine if they did they would not always appear in the same order in which the bytes to send were queued.

sullivan-sean avatar Nov 22 '21 18:11 sullivan-sean

Very good observation on the race, it took me a while to see it, but it is right there in front of my eyes!

This is the current patch I have, I am in the process of downloading a working Xcode to test:

https://gist.github.com/migueldeicaza/0bb69eacc777a626dc0ba1a23d2e5421

migueldeicaza avatar Dec 17 '21 02:12 migueldeicaza

Ok that patch is not qutie ready: the DispatchWrite does the write, but never invokes the callback.

That's because I did not understand the purpose of the callback.

migueldeicaza avatar Dec 17 '21 03:12 migueldeicaza

Ok, couple of updates.

  1. The multi-queue approach for some reason causes control-c on the MacTerminal to not be delivered immediately under load (for example, running ls -lR /, and this causes the control-c to take a long time before it responds. Can take minutes to do so.

https://gist.github.com/migueldeicaza/4c8fb6497a3a06d10739f0401de1d6f9

This patch is for swifterm

  1. I tried going without the multi-queue approach, and just using a DispatchQueue just for reading and instead handling writers directly over the socket. This poses the problem that somehow the child process in the testcase hangs at around 8192 bytes of output produced - even if I have read all the data, so something is putting data that I am not reading out.

https://gist.github.com/migueldeicaza/773f6c79906de6dd9472cbc9b281719f

this patch is against the test case, to easily try out different versions of LocalProcess.

migueldeicaza avatar Dec 18 '21 03:12 migueldeicaza

I am strangely invested in this ticket. Iā€™d be happy to debug tomorrow when I get a chance. šŸ˜‚

Not kidding though, I had a full day remote coding interview and during my lunch break I started reading through the gist for the bug. I think I spotted a couple actually. Will get back.

JoeMatt avatar Dec 18 '21 09:12 JoeMatt

There is now a bug fix on main

migueldeicaza avatar May 23 '23 10:05 migueldeicaza