NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

FileNotFoundError during checkpoint saving in nemo_model_checkpoint.py

Open mujhenahiata opened this issue 5 months ago • 0 comments

When training a speech-to-text model using NeMo and PyTorch Lightning, the training crashes during the validation phase due to a FileNotFoundError while attempting to remove an older .nemo checkpoint file.

Environment:

NeMo version: (2.2.0) Python version: 3.10

Reproduction Steps:

  • Train a model using the speech_to_text_ctc_bpe.py script.
  • Set up model checkpointing using nemo_model_checkpoint.py (default or modified).
  • After validation runs, during checkpoint saving, the script crashes.

nemo/utils/callbacks File "nemo_model_checkpoint.py", line 246, in on_save_checkpoint get_filesystem(backup_path).rm(backup_path)

@titu1994

i have just updated it , there is some issue with how ddp and global ranks are going into some race condition. it works as expected on a single GPU, but crashes on a cluster. should i raise a PR ?

mujhenahiata avatar May 14 '25 15:05 mujhenahiata