nanoGPT icon indicating copy to clipboard operation
nanoGPT copied to clipboard

Cuda out of Memory

Open hanfluid opened this issue 1 year ago • 3 comments

Try on a cluster using multiple nodes. Example:

  1. run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<Master node's IP>--master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False" on the Master node
  2. ssh to the second node
  3. run "torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<Master node's IP> --master_port=1234 train.py --dataset=shakespeare --dtype=float16 --batch_size=2 --compile=False"

Got errors: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 5.17 GiB already allocated; 8.68 GiB free; 5.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

hanfluid avatar Jan 25 '23 02:01 hanfluid