Caleb Spradlin

Results 2 comments of Caleb Spradlin

We are seeing this as well. A tricky one for sure. No problems with the training phase, just when saving the checkpoint it seems. Machine(s): 5x nodes : AMD Rome,...

> > Figured this out. This is because NCCL cannot use the memory in the PyTorch memory pool and a CUDA OOM occurs during NCCL collective operation. Set `NCCL_DEBUG=INFO` to...