OSSubprocess icon indicating copy to clipboard operation
OSSubprocess copied to clipboard

Message #waitForExit doesn’t always return when the process has terminated

Open Rinzwind opened this issue 3 years ago • 2 comments

We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:

https://github.com/Rinzwind/OSSubprocessTest

The example:

Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:

  • The line Did #run (m: 2542, pid: 6480) shows that a ‘curl’ process was spawned with PID 6480
  • The lines 6480 Z 00:00:00 \_ curl <defunct> from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process)
  • There is no corresponding line Did #waitForExit (m: 2542) showing a return from #waitForExit
  • The only remaining output is from the ‘ps’ command

Note there’s also no more line Will #checkFinishedChildren after the last Did #run. As far as I understand, the child watcher in OSSVMProcess should be sending #checkFinishedChildren at least once a second:

https://github.com/pharo-contributions/OSSubprocess/blob/67ded100f5e1fc0ca56434f857c9388b4c94457d/repository/OSSubprocess/OSSVMProcess.class.st#L234-L244

So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.

Sample output:

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-9.0 .smalltalk.ston
[…]
Downloading the latest 90 Image:
    http://files.pharo.org/get-files/90/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/90/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-9.0.0+build.1575.sha.9bb5f998e8a6d016ec7abde3ed09c4a60c0b4551 (64 Bit)
[…]
Running suite "Pharo64-9.0" with 1 tests.
[…]
Did #run (m: 2541, pid: 6478)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT

Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 2541)
Did #run (m: 2542, pid: 6480)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT

Test
Did sleep (n: 26)
    PID STAT     TIME COMMAND
   1230 Sl   00:00:10 pharo
   1245 S    00:00:00  \_ bash
   6480 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 27)
    PID STAT     TIME COMMAND
   1230 Sl   00:00:10 pharo
   1245 S    00:00:00  \_ bash
   6480 Z    00:00:00  \_ curl <defunct>
[…]

Rinzwind avatar Aug 03 '22 19:08 Rinzwind

Here’s sample output from a run using Pharo 11 in which the problem also occurred: the ‘curl’ process with PID 13026 has terminated, but the output is not showing a corresponding Did #waitForExit (m: 5718).

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-11 .smalltalk.ston
[…]
Downloading the latest 110 Image:
    http://files.pharo.org/get-files/110/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/110/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-11.0.0+build.142.sha.78baf8659decaa50e73226a5eddcd410cebfa4f7 (64 Bit)
[…]
Running suite "Pharo64-11" with 1 tests.
[…]
Did #run (m: 5717, pid: 13024)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Wed, 03 Aug 2022 22:35:33 GMT
Content-Type: text/plain
Content-Length: 5

Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 5717)
Did #run (m: 5718, pid: 13026)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Wed, 03 Aug 2022 22:35:33 GMT
Content-Type: text/plain
Content-Length: 5

Test
Did sleep (n: 56)
    PID STAT     TIME COMMAND
   1221 Sl   00:00:20 pharo
   1236 S    00:00:00  \_ bash
  13026 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 57)
    PID STAT     TIME COMMAND
   1221 Sl   00:00:20 pharo
   1236 S    00:00:00  \_ bash
  13026 Z    00:00:00  \_ curl <defunct>
[…]

Rinzwind avatar Aug 03 '22 23:08 Rinzwind

After adding this to also log the sends of #waitTimeoutMSecs: in OSSVMProcess>>#initializeChildWatcher:

OSSVMProcess compile: ((OSSVMProcess sourceCodeAt: #initializeChildWatcher)
	copyReplaceAll: 'self sigChldSemaphore waitTimeoutMSecs: 1000.'
	with: '
		Stdio stdout nextPutAll: ''Will #waitTimeoutMSecs:''; lf; flush.
		self sigChldSemaphore waitTimeoutMSecs: 1000.
		Stdio stdout nextPutAll: ''Did #waitTimeoutMSecs:''; lf; flush.').

We get output on our Jenkins server as shown below. After the last Will #waitTimeoutMSecs: there is no Did #waitTimeoutMSecs:, so it seems there is no return from either the send of #sigChldSemaphore or the send of #waitTimeoutMSecs: to the Semaphore here:

https://github.com/pharo-contributions/OSSubprocess/blob/67ded100f5e1fc0ca56434f857c9388b4c94457d/repository/OSSubprocess/OSSVMProcess.class.st#L241

Sample output:

[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-11 .smalltalk.ston
[…]
Downloading the latest 110 Image:
    http://files.pharo.org/get-files/110/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
	http://files.pharo.org/get-files/110/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-11.0.0+build.145.sha.1fc9017e1fe9a0d3c3b7580e2131ecfd9a27e428 (64 Bit)
[…]
Running suite "Pharo64-11" with 1 tests.
[…]
Did #run (m: 2767, pid: 6945)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Thu, 11 Aug 2022 13:35:28 GMT
Content-Type: text/plain
Content-Length: 5

Test
Did #waitTimeoutMSecs:
Will #checkFinishedChildren
Did #checkFinishedChildren
Will #waitTimeoutMSecs:
Did #waitForExit (m: 2767)
Did #run (m: 2768, pid: 6947)
Did sleep (n: 27)
HTTP/1.1 200 OK
Server: Zinc HTTP Components 1.0 (Pharo/11.0)
Date: Thu, 11 Aug 2022 13:35:28 GMT
Content-Type: text/plain
Content-Length: 5

Test
    PID STAT     TIME COMMAND
   1218 Sl   00:00:11 pharo
   1233 S    00:00:00  \_ bash
   6947 Z    00:00:00  \_ curl <defunct>
Did sleep (n: 28)
    PID STAT     TIME COMMAND
   1218 Sl   00:00:11 pharo
   1233 S    00:00:00  \_ bash
   6947 Z    00:00:00  \_ curl <defunct>
[…]

Rinzwind avatar Aug 11 '22 13:08 Rinzwind