awsome-distributed-training
awsome-distributed-training copied to clipboard
FSDP Training Job failing on Validation Step (Batch 500)
Running the script 3.test_cases/10.FSDP/1.distributed-training.sbatch on 2 p5 nodes, and the job is failing at validation step after 500 batches.
0: OSError: [Errno 12] Cannot allocate memory
Configuration: SageMaker HyperPod - 2x P5 nodes
Ubuntu 20.04 DLAMI, NCCL version 2.19.4+cuda12.1