awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

FSDP Training Job failing on Validation Step (Batch 500)

Open nghtm opened this issue 9 months ago • 1 comments

Running the script 3.test_cases/10.FSDP/1.distributed-training.sbatch on 2 p5 nodes, and the job is failing at validation step after 500 batches.

slurm-47.log

0: OSError: [Errno 12] Cannot allocate memory

Configuration: SageMaker HyperPod - 2x P5 nodes

Ubuntu 20.04 DLAMI, NCCL version 2.19.4+cuda12.1

nghtm avatar May 19 '24 23:05 nghtm