pytorch-operator The training hangs after reloading one of master/worker pods

The training hangs after reloading one of master/worker pods

Open dmitsf opened this issue 4 years ago • 5 comments

Hello! I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading can be different, usually, it's due to Google Cloud Engine node rescheduling. Also, I tried to kill pods myself - the behavior was the same. Can I avoid this behavior and make training tolerant to pods' reloading?

Oct 28 '21 14:10 dmitsf

Can you tell us the pytorch version?

Oct 29 '21 02:10 gaocegege

I use pytorch 1.9.0.

Oct 29 '21 12:10 dmitsf

Are you using torch.distributed.run?

Oct 29 '21 14:10 gaocegege

I don't use it at the moment. I followed mnist example to adjust my training script.

Oct 29 '21 16:10 dmitsf

Can you please show us the script and the YAML file? PyTorch 1.9 introduced elastic training and it may hang.

Oct 30 '21 01:10 gaocegege

pytorch-operator pytorch-operator copied to clipboard

The training hangs after reloading one of master/worker pods

pytorch-operator
pytorch-operator copied to clipboard