Boda Sadallah
Results
2
issues of
Boda Sadallah
Related to **Model/Framework(s)** *PyTorch Distributed Training* **Describe the bug** Error: ``` torch.cuda.set_rng_state(checkpoint['cuda_rng_state_all'][device_id]) IndexError: index 2 is out of bounds for dimension 0 with size 2 ``` This error happens as...
bug
Related to **Model/Framework(s)** PyTorch Distributed Training **Describe the bug** The bug happens, with multinode training, cause in the training script `local_rank` is used to save checkpoints, so it repeats for...
bug