ColossalAI-Examples icon indicating copy to clipboard operation
ColossalAI-Examples copied to clipboard

'RuntimeError: CUDA error: an illegal memory access was encountered' with large batch size of GPT2-example

Open Gy-Lu opened this issue 3 years ago • 1 comments
trafficstars

🐛 Describe the bug

When I ran gpt2-vanilla with a batch size of 64, there was a CUDA error RuntimeError: CUDA error: an illegal memory access was encountered. Then I printed the memory usage of GPU. At the second iteration, the max allocated memory was 74GB(with torch.max_memory_allocated), then the error happened, while the allocated memory was no more than 50GB(with torch.memory_allocated). It also happened when comes to gpt2-zero3. I think the peak memory usage was out of memory, while the total memory allocated was not. This bug may be fixed with PyTorch's update :)

Environment

CUDA/11.3.1 NCCL/2.9.6 Python/3.8.12 PyTorch/1.10.1+cu113

Gy-Lu avatar Mar 30 '22 03:03 Gy-Lu

I have tested setting PYTORCH_NO_CUDA_MEMORY_CACHING=1. It fails.

Gy-Lu avatar Apr 21 '22 09:04 Gy-Lu