OSSubprocess
OSSubprocess copied to clipboard
Message #waitForExit doesn’t always return when the process has terminated
We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:
https://github.com/Rinzwind/OSSubprocessTest
The example:
- Sets up a Seaside request handler at ‘/test’ (see:
TestRequestHandler>>#handleFiltered:) - Patches OSSVMProcess to log sends of
#checkFinishedChildren(see:BaselineOfTest>>#postLoad) - Runs a test that uses ‘curl’ to request ‘/test’, repeated 10000 times while another ‘bash’ process executes ‘ps’ every 5 seconds (see:
TestTestCase>>#test)
Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:
- The line
Did #run (m: 2542, pid: 6480)shows that a ‘curl’ process was spawned with PID 6480 - The lines
6480 Z 00:00:00 \_ curl <defunct>from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process) - There is no corresponding line
Did #waitForExit (m: 2542)showing a return from#waitForExit - The only remaining output is from the ‘ps’ command
Note there’s also no more line Will #checkFinishedChildren after the last Did #run. As far as I understand, the child watcher in OSSVMProcess should be sending #checkFinishedChildren at least once a second:
https://github.com/pharo-contributions/OSSubprocess/blob/67ded100f5e1fc0ca56434f857c9388b4c94457d/repository/OSSubprocess/OSSVMProcess.class.st#L234-L244
So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.
Sample output:
[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-9.0 .smalltalk.ston
[…]
Downloading the latest 90 Image:
http://files.pharo.org/get-files/90/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
http://files.pharo.org/get-files/90/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-9.0.0+build.1575.sha.9bb5f998e8a6d016ec7abde3ed09c4a60c0b4551 (64 Bit)
[…]
Running suite "Pharo64-9.0" with 1 tests.
[…]
Did #run (m: 2541, pid: 6478)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT
Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 2541)
Did #run (m: 2542, pid: 6480)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT
Test
Did sleep (n: 26)
PID STAT TIME COMMAND
1230 Sl 00:00:10 pharo
1245 S 00:00:00 \_ bash
6480 Z 00:00:00 \_ curl <defunct>
Did sleep (n: 27)
PID STAT TIME COMMAND
1230 Sl 00:00:10 pharo
1245 S 00:00:00 \_ bash
6480 Z 00:00:00 \_ curl <defunct>
[…]
Here’s sample output from a run using Pharo 11 in which the problem also occurred: the ‘curl’ process with PID 13026 has terminated, but the output is not showing a corresponding Did #waitForExit (m: 5718).
[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-11 .smalltalk.ston
[…]
Downloading the latest 110 Image:
http://files.pharo.org/get-files/110/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
http://files.pharo.org/get-files/110/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-11.0.0+build.142.sha.78baf8659decaa50e73226a5eddcd410cebfa4f7 (64 Bit)
[…]
Running suite "Pharo64-11" with 1 tests.
[…]
Did #run (m: 5717, pid: 13024)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Wed, 03 Aug 2022 22:35:33 GMT
Content-Type: text/plain
Content-Length: 5
Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 5717)
Did #run (m: 5718, pid: 13026)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Wed, 03 Aug 2022 22:35:33 GMT
Content-Type: text/plain
Content-Length: 5
Test
Did sleep (n: 56)
PID STAT TIME COMMAND
1221 Sl 00:00:20 pharo
1236 S 00:00:00 \_ bash
13026 Z 00:00:00 \_ curl <defunct>
Did sleep (n: 57)
PID STAT TIME COMMAND
1221 Sl 00:00:20 pharo
1236 S 00:00:00 \_ bash
13026 Z 00:00:00 \_ curl <defunct>
[…]
After adding this to also log the sends of #waitTimeoutMSecs: in OSSVMProcess>>#initializeChildWatcher:
OSSVMProcess compile: ((OSSVMProcess sourceCodeAt: #initializeChildWatcher)
copyReplaceAll: 'self sigChldSemaphore waitTimeoutMSecs: 1000.'
with: '
Stdio stdout nextPutAll: ''Will #waitTimeoutMSecs:''; lf; flush.
self sigChldSemaphore waitTimeoutMSecs: 1000.
Stdio stdout nextPutAll: ''Did #waitTimeoutMSecs:''; lf; flush.').
We get output on our Jenkins server as shown below. After the last Will #waitTimeoutMSecs: there is no Did #waitTimeoutMSecs:, so it seems there is no return from either the send of #sigChldSemaphore or the send of #waitTimeoutMSecs: to the Semaphore here:
https://github.com/pharo-contributions/OSSubprocess/blob/67ded100f5e1fc0ca56434f857c9388b4c94457d/repository/OSSubprocess/OSSVMProcess.class.st#L241
Sample output:
[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-11 .smalltalk.ston
[…]
Downloading the latest 110 Image:
http://files.pharo.org/get-files/110/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
http://files.pharo.org/get-files/110/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-11.0.0+build.145.sha.1fc9017e1fe9a0d3c3b7580e2131ecfd9a27e428 (64 Bit)
[…]
Running suite "Pharo64-11" with 1 tests.
[…]
Did #run (m: 2767, pid: 6945)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Thu, 11 Aug 2022 13:35:28 GMT
Content-Type: text/plain
Content-Length: 5
Test
Did #waitTimeoutMSecs:
Will #checkFinishedChildren
Did #checkFinishedChildren
Will #waitTimeoutMSecs:
Did #waitForExit (m: 2767)
Did #run (m: 2768, pid: 6947)
Did sleep (n: 27)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Thu, 11 Aug 2022 13:35:28 GMT
Content-Type: text/plain
Content-Length: 5
Test
PID STAT TIME COMMAND
1218 Sl 00:00:11 pharo
1233 S 00:00:00 \_ bash
6947 Z 00:00:00 \_ curl <defunct>
Did sleep (n: 28)
PID STAT TIME COMMAND
1218 Sl 00:00:11 pharo
1233 S 00:00:00 \_ bash
6947 Z 00:00:00 \_ curl <defunct>
[…]