gtoolkit
gtoolkit copied to clipboard
Remote runner blocks sometimes on Mac and Windows
Executing this job using the remote runner and one worker blocks almost all the time on Mac and Windows. All works good on Linux.
factory := GtRrExampleTestFactory new.
factory addExampleClasses: {
Dictionary.
GtInspectorVariableValuePairsExamples .
ByteArray }.
job := GtRemoteRunner default submitJob: factory job.
GtInspectorVariableValuePairsExamples has examples that return several large arrays.
Instead of it we can also use the following class:
Object subclass: #TestBlockingRunner
instanceVariableNames: ''
classVariableNames: ''
package: 'Haba'.
#TestBlockingRunner asClass
compile: 'testArrayPairsOverLimit
<gtExample>
| limit pairs |
limit := 2 * 100000 + 1.
pairs := (1 to: limit) asArray collect: [ :e | e -> e ].
^ pairs asOrderedCollection'.
#TestBlockingRunner asClass
compile: 'testArrayPairsUnderLimit
<gtExample>
| limit pairs |
limit := 2 * 5000 - 1.
pairs := (1 to: limit) asArray collect: [ :e | e -> e ].
^ pairs asOrderedCollection'
This initial bug found was an error where the socket would be incorrectly marked "OtherEndClosed" if reading the socket would be a blocking operation. This is fixed by https://github.com/pharo-project/opensmalltalk-vm/commit/826736844cbc4f9b09cc205db23c53a1adef41ee in the Pharo 9.0.13 VM.
However that VM makes other changes to AIO (https://github.com/pharo-project/opensmalltalk-vm/blob/pharo-9/extracted/vm/src/win/aioWin.c) that conflict with glutin's event polling causing sockets to not ever receive events. A workaround was added in https://github.com/akgrant43/opensmalltalk-vm/commit/1200a143d153559e1f5d6bb65a574cbfe74bd590 that got sockets basically working again (although with issues, as described below).
While sockets were basically working, if there was no other I/O, socket I/O would be extremely slow on Windows, up to 100x slower. This is because the sockets were only polled if the poll timed out, not if it was woken by some other VM operation. https://github.com/akgrant43/opensmalltalk-vm/commit/43a448a1a8a5358d21d047659d0486c2477b0214 resolves this, and makes Windows socket performance similar to Mac and Linux (slightly less CPU efficient due to the excess polling).
It was then discovered that sockets on Mac would still hang when large buffers were being transferred. In this scenario it appears that the flags passed to dataHandler() shows that the write semaphore should be signalled (AIO_W is set), but not the read semaphore (AIO_R is clear). However this appears to result in subsequent polls never setting AIO_R. Signalling the read semaphore whenever AIO_W is set avoids this issue, see https://github.com/akgrant43/opensmalltalk-vm/commit/dc888ece23098ddb44d88033c06498aa91b9ff99.
However Gt on Windows is still hanging under some (as yet unknown) circumstances, sometimes it isn't responding to socket or mouse I/O, and sometimes the process hangs completely (the OS shows it as 'Not Responding'). There are other changes in the OpenSmalltalk VM that aren't in the Pharo VM that need to be investigated. The two VMs have diverged over the last 3 years, so it isn't a straight forward merge.
Related to https://github.com/pharo-project/pharo/issues/11083
I believe this can be closed now, or?
I believe the socket issue still exists on Windows. We have a workaround for RemoteRunner. I'm not sure of the socket status on Mac. @chisandrei , do you know?
I think lots of things changed in this area since the issue. We should open specific ones if there still exist problems of this nature.
Ok, this is a tracking issue until the Pharo one is solved.