nanoGPT
nanoGPT copied to clipboard
Cuda out of Memory
Try on a cluster using multiple nodes. Example:
- run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<Master node's IP>--master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False" on the Master node
- ssh to the second node
- run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<Master node's IP> --master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False"
Got errors: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 5.17 GiB already allocated; 8.68 GiB free; 5.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Try this: i only have a macbook
Hi, I'm getting the same CUDA out of memory error. I am running train.py
on a single GPU node with 16GB total memory. I first ran this with a batch_size of 32
python train.py --batch_size=32 --compile=False
I saw this error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.14 GiB (GPU 0; 15.78 GiB total capacity; 10.16 GiB already allocated; 4.63 GiB free; 10.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I reduced the batch_size to 12, and it got to the first training step. But saw a similar error
python train.py --batch_size=12 --compile=False
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 15.78 GiB total capacity; 14.54 GiB already allocated; 123.69 MiB free; 14.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Did you figure out how to resolve this?
followup: I upgraded torch to 2.0 and reduced batch size further to 6. I can run the train.py code now on my dataset