sos icon indicating copy to clipboard operation
sos copied to clipboard

Workflow workers do not shutdown cleanly after Ctrl-C

Open BoPeng opened this issue 5 years ago • 2 comments

Workflow

input: for_each=dict(i=range(5)), concurrent=True

python: expand='${ }'
  import time
  import os
  for rep in range(100):
      print(f'I am {rep} of task ${i} from {os.getpid()}')
      time.sleep(1)

Submit Ctrl-C after the workflow has been running with, say, 5 workers.

Tracing a bit shows that

  1. The subprocesses receives Ctrl-C but ignores it. The worker will wait for master instruction.
  2. The master processes receives Ctrl-C and sends calls kill_all, which sends None to the workers.
  3. The workers responded and start quiting.

All but the last worker existed successfully. The last one stops at disconnect socket, so it seems that socket communication with the master controller is unsuccessful.

BoPeng avatar May 30 '20 01:05 BoPeng

Using

export SOS_DEBUG=CONTROLLER,-

will show that one of the processes tries to terminate without closing one last socket. Will have to figure out which socket that is still open.

BoPeng avatar May 30 '20 01:05 BoPeng

When C-C happens and the worker are doing something. The worker ignores the signal but the controller will send None to workers to let them stop.

The problem is that when the worker is busy with a step, the step is now killed properly. In particular, the step executor has a result-collector socket that will not be automatically closed, and the worker will refuse to stop.

It is not easy because this is a communication channel between substep executor and step executor, so the workers do not know how to close the socket of the step executor, not know how to send a signal from the substep executor side to close it. I could be a low priority bug for now.

BoPeng avatar Jun 02 '20 04:06 BoPeng