kubedl icon indicating copy to clipboard operation
kubedl copied to clipboard

pytorch job seems to succeed but data is not completed

Open WencongXiao opened this issue 3 years ago • 1 comments

Dear, all

When I used kubedl in managing distributed pytorch jobs, I found that sometimes the output data of a pytorch job cannot be fully saved to the storage system, however, the pytorch job will still claim itself as successful.

To understand that, I check the detailed logs of each worker. Logs show that after the master role of the pytorch job gets completed, the whole pytorch job becomes terminated, regardless of the running pods of the other pytorch workers. I think it should be the reason for the incomplete output data.

It seems to be a bug in my view. Could you please take a check?

Thanks, Wencong

WencongXiao avatar Mar 15 '21 14:03 WencongXiao

hi @WencongXiao , the issue you've mentioned will be fixed soon, we'll introduce successPolicy to PyTorchJob to address this, hence PyTorchJob will be marked as Succeed when all workers exit successfully.

SimonCqk avatar Apr 26 '21 08:04 SimonCqk