DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[Tacotron2/PyTorch] cuda.rang_state errore?

Open BodaSadalla98 opened this issue 2 years ago • 0 comments

Related to Model/Framework(s) PyTorch Distributed Training

Describe the bug Error:

torch.cuda.set_rng_state(checkpoint['cuda_rng_state_all'][device_id])
IndexError: index 2 is out of bounds for dimension 0 with size 2

This error happens as we save a cuda_rng_state for each gpu in the checkpoint dictionary.

Then when we load a checkpoint, we load a saved state for each gpu based on its local rank

To Reproduce Steps to reproduce the behavior: 1- train on multi gpu / multi node. (2 gpus for example) 2- save a checkpoint 3- continue training from the saved checkpoint but with more number of gpus (4 gpus for example)

Suggested Solution check if the saved checkpoint have enough states for the current training configuration, and only load it then.

BodaSadalla98 avatar Mar 14 '22 09:03 BodaSadalla98