Retriever prob

Open zsyggg opened this issue 1 year ago • 1 comments

When I ran the retriever code, I ran it with four 3090 graphics cards and got an error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 75.69 MiB is free. Including non-PyTorch memory, this process has 23.61 GiB memory in use. Of the allocated memory 22.85 GiB is allocated by PyTorch, and 383.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Because the per_gpu_train_batch_size was set to 32 and the gradient_accumulation_steps was 1, I changed them to 16 and 2, respectively, and they worked, but I still wanted to know why the original code had the OOM problem? Even if we change our distributed training from DataParallel to distributedDataParallel, we still have OOM issues, but it could be that we're doing it wrong gpu error

May 24 '24 02:05 zsyggg

Thanks. 24GB is suitable for a batch size of 16. Larger batch sizes require larger GPU memories to run.

May 24 '24 02:05 Andy-jqa