Caleb Spradlin comments

Repositories
Issues
Comments

Results 2 comments of


                                            Caleb Spradlin

[BUG] Failed to checkpoint with deepspeed 0.12.4

We are seeing this as well. A tricky one for sure. No problems with the training phase, just when saving the checkpoint it seems. Machine(s): 5x nodes : AMD Rome,...

[BUG] Failed to checkpoint with deepspeed 0.12.4

> > Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set `NCCL_DEBUG=INFO` to...