NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Make checkpoint more robust

Open gshennvm opened this issue 1 year ago • 1 comments

currently when saving a nemo checkpoint it creates a directory and then it will save.

this is not robust because the training job can get preempted in the middle of save, creating a corrupted checkpoint.

Ideally we want nemo to create a temporary file, save to the temporary file and only when done saving should this file be renamed to an actual checkpoint

gshennvm avatar Feb 01 '24 00:02 gshennvm

Hi @gshennvm Fix for incomplete checkpoints was recently merged into main [PR7952]. Would it be possible for you to verify if that resolves the problem you noticed?

jbieniusiewi avatar Feb 09 '24 11:02 jbieniusiewi

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Mar 11 '24 01:03 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Mar 19 '24 01:03 github-actions[bot]