Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

`CUDA out of memory` error when using multiple GPUs

Open mtyszczak opened this issue 1 year ago • 2 comments

With GPUs:

0	NVIDIA GeForce GTX 1660 SUPER
1	NVIDIA GeForce GTX 1660 SUPER
2	NVIDIA GeForce GTX 1660 SUPER
3	NVIDIA GeForce GTX 1660 SUPER
4	NVIDIA GeForce GTX 1660 SUPER
5	NVIDIA GeForce GTX 1660 SUPER

when training a model: infer/modules/train/train.py -e "model1" -sr 40k -f0 1 -bs 15 -g 0-1-2-3-4-5 -te 100 -se 5 -pg assets/pretrained_v2/f0G40k.pth -pd assets/pretrained_v2/f0D40k.pth -l 0 -c 1 -sw 0 -v v2 produces:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacty of 5.80 GiB of which 5.75 MiB is free. Including non-PyTorch memory, this process has 5.78 GiB memory in use. Of the allocated memory 5.53 GiB is allocated by PyTorch, and 118.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Reducing the batch size to 1 with 6 GPUs on the board did not fix the problem.

Workaround: training on only one GPU with batch size set to 3 allowed me to train the model, but using multiple GPUs would be appreciated.

OS & drivers info:

Linux server 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
VERSION="23.10 (Mantic Minotaur)"
pip 23.3.2 from /usr/local/lib/python3.8/site-packages/pip (python 3.8)
NVIDIA Driver Version: 535.146.02
NVML Version: 12.535.146.02

mtyszczak avatar Dec 23 '23 13:12 mtyszczak

I have also encountered that issue but for other project, The only workaround I did was to use 4 Gpu's instead of 6.

I also had 6 Gpu's SO when using multiple Gpu's Try using 4 gpus (0-1-2-3) Try using 5 gpu's (0-1-2-3-4)

haseebsultankhan avatar Dec 27 '23 05:12 haseebsultankhan

Any possible fix?

Abedalhkeem-z avatar Mar 11 '24 15:03 Abedalhkeem-z

This issue was closed because it has been inactive for 15 days since being marked as stale.

github-actions[bot] avatar Apr 28 '24 04:04 github-actions[bot]