NeMo
NeMo copied to clipboard
FileNotFoundError during checkpoint saving in nemo_model_checkpoint.py
When training a speech-to-text model using NeMo and PyTorch Lightning, the training crashes during the validation phase due to a FileNotFoundError while attempting to remove an older .nemo checkpoint file.
Environment:
NeMo version: (2.2.0) Python version: 3.10
Reproduction Steps:
- Train a model using the speech_to_text_ctc_bpe.py script.
- Set up model checkpointing using nemo_model_checkpoint.py (default or modified).
- After validation runs, during checkpoint saving, the script crashes.
nemo/utils/callbacks File "nemo_model_checkpoint.py", line 246, in on_save_checkpoint get_filesystem(backup_path).rm(backup_path)
@titu1994
i have just updated it , there is some issue with how ddp and global ranks are going into some race condition. it works as expected on a single GPU, but crashes on a cluster. should i raise a PR ?