sos icon indicating copy to clipboard operation
sos copied to clipboard

Worker killed error

Open gaow opened this issue 5 years ago • 3 comments

I got an error message that comes from this line of code:

https://github.com/vatlab/sos/blob/cf3717e280d01a3ce5f6650f275aa1f91832d97a/src/sos/workers.py#L689

I wonder if it is a critical or benign error -- I am thinking a warning might do because if the worker is killed and the task did not have an output then some other errors will be triggered anyways, for, say missing output files?

INFO: Waiting for the completion of 50 tasks before submitting 554 pending ones.
WARNING: Task M200_d439b16a19c9fd37 considered as aborted due to missing pulse file.
INFO: M200_14c70cf6212440ef submitted to midway2 with job id 63661901
INFO: M200_b4cc366abb70d754 submitted to midway2 with job id 63661904
INFO: M200_ed7d96c64609238e submitted to midway2 with job id 63661905
INFO: M200_46faec8ac706332e submitted to midway2 with job id 63661906
ERROR: One of the local workers has been killed.

gaow avatar Nov 10 '19 01:11 gaow

This means one of the workers has been killed by external force, and currently SoS exits without trying to recover from the error. Do you have any idea why it happened? (e.g. process killed due to out of memory).

BoPeng avatar Nov 10 '19 01:11 BoPeng

Yes i figured it is the manager thread that failed. It is a run from the headnode. I've no idea why it fails though because I've been running similar pipelines on headnode it never have complained before. Do you think it is still good idea to raise an error here, or are there any attempts we can possibly make to recover?

gaow avatar Nov 10 '19 02:11 gaow

Recovering from a failed worker is very difficult because the master has to know who is doing what, use a ping-pong protocol to check if the status of workers, then try to restart a worker and re-send the job if one died. A robust workflow system should be able to do this but it is too difficult (time-consuming) for me to do at this point.

BoPeng avatar Nov 10 '19 04:11 BoPeng