DeepLearningExamples [Tacotron2/Pytorch] Multi-node error on saving checkpoints?

[Tacotron2/Pytorch] Multi-node error on saving checkpoints?

Open BodaSadalla98 opened this issue 2 years ago • 0 comments

Related to Model/Framework(s) PyTorch Distributed Training

Describe the bug The bug happens, with multinode training, cause in the training script local_rank is used to save checkpoints, so it repeats for each node. And sometimes an error produced as more than one node trying to write the checkpoint files in the same time, or trying to create a symlink to the last checkpoint .

ERROR:

File "train.py", line 229, in save_checkpoint
    print("Updating symlink", symlink_dst, "to point to", symlink_src)
FileExistsError: [Errno 17] File exists: 'checkpoint_Tacotron2_0.pt' -> 'output/checkpoint_Tacotron2_last.pt'

To Reproduce Steps to reproduce the behavior:

Train on a multinode cluster.

Expected behavior I think a good solution is to save checkpoints based on global_rank so we save once

Mar 14 '22 09:03 BodaSadalla98

DeepLearningExamples DeepLearningExamples copied to clipboard

[Tacotron2/Pytorch] Multi-node error on saving checkpoints?

DeepLearningExamples
DeepLearningExamples copied to clipboard