nanoGPT
nanoGPT copied to clipboard
Cuda out of Memory
Try on a cluster using multiple nodes. Example:
- run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<Master node's IP>--master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False" on the Master node
- ssh to the second node
- run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<Master node's IP> --master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False"
Got errors: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 5.17 GiB already allocated; 8.68 GiB free; 5.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF