Make checkpoint more robust
currently when saving a nemo checkpoint it creates a directory and then it will save.
this is not robust because the training job can get preempted in the middle of save, creating a corrupted checkpoint.
Ideally we want nemo to create a temporary file, save to the temporary file and only when done saving should this file be renamed to an actual checkpoint
Hi @gshennvm Fix for incomplete checkpoints was recently merged into main [PR7952]. Would it be possible for you to verify if that resolves the problem you noticed?
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.