nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Cuda out of Memory

Open hanfluid opened this issue 2 years ago • 3 comments

Try on a cluster using multiple nodes. Example:

  1. run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<Master node's IP>--master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False" on the Master node
  2. ssh to the second node
  3. run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<Master node's IP> --master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False"

Got errors: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 5.17 GiB already allocated; 8.68 GiB free; 5.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hanfluid avatar Jan 25 '23 02:01 hanfluid

Try this: i only have a macbook

davidbullado avatar Jan 31 '23 12:01 davidbullado

Hi, I'm getting the same CUDA out of memory error. I am running train.py on a single GPU node with 16GB total memory. I first ran this with a batch_size of 32

python train.py --batch_size=32 --compile=False

I saw this error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.14 GiB (GPU 0; 15.78 GiB total capacity; 10.16 GiB already allocated; 4.63 GiB free; 10.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I reduced the batch_size to 12, and it got to the first training step. But saw a similar error python train.py --batch_size=12 --compile=False

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 15.78 GiB total capacity; 14.54 GiB already allocated; 123.69 MiB free; 14.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Did you figure out how to resolve this?

aartivnkt avatar Mar 08 '23 15:03 aartivnkt

followup: I upgraded torch to 2.0 and reduced batch size further to 6. I can run the train.py code now on my dataset

aartivnkt avatar Mar 08 '23 20:03 aartivnkt