sos
sos copied to clipboard
Workflow workers do not shutdown cleanly after Ctrl-C
Workflow
input: for_each=dict(i=range(5)), concurrent=True
python: expand='${ }'
import time
import os
for rep in range(100):
print(f'I am {rep} of task ${i} from {os.getpid()}')
time.sleep(1)
Submit Ctrl-C after the workflow has been running with, say, 5 workers.
Tracing a bit shows that
- The subprocesses receives
Ctrl-Cbut ignores it. The worker will wait for master instruction. - The master processes receives
Ctrl-Cand sends callskill_all, which sendsNoneto the workers. - The workers responded and start quiting.
All but the last worker existed successfully. The last one stops at disconnect socket, so it seems that socket communication with the master controller is unsuccessful.
Using
export SOS_DEBUG=CONTROLLER,-
will show that one of the processes tries to terminate without closing one last socket. Will have to figure out which socket that is still open.
When C-C happens and the worker are doing something. The worker ignores the signal but the controller will send None to workers to let them stop.
The problem is that when the worker is busy with a step, the step is now killed properly. In particular, the step executor has a result-collector socket that will not be automatically closed, and the worker will refuse to stop.
It is not easy because this is a communication channel between substep executor and step executor, so the workers do not know how to close the socket of the step executor, not know how to send a signal from the substep executor side to close it. I could be a low priority bug for now.