kubedl
kubedl copied to clipboard
pytorch job seems to succeed but data is not completed
Dear, all
When I used kubedl in managing distributed pytorch jobs, I found that sometimes the output data of a pytorch job cannot be fully saved to the storage system, however, the pytorch job will still claim itself as successful.
To understand that, I check the detailed logs of each worker. Logs show that after the master role of the pytorch job gets completed, the whole pytorch job becomes terminated, regardless of the running pods of the other pytorch workers. I think it should be the reason for the incomplete output data.
It seems to be a bug in my view. Could you please take a check?
Thanks, Wencong
hi @WencongXiao , the issue you've mentioned will be fixed soon, we'll introduce successPolicy
to PyTorchJob
to address this, hence PyTorchJob
will be marked as Succeed when all workers exit successfully.