FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

CUDA out of memory in CLI vicuna 7B

Open mpetruc opened this issue 1 year ago • 1 comments

Running inference using vicuna 7B on a 16Gb 3080. Occasionally the script crashes with an error like: RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 16.00 GiB total capacity; 13.69 GiB already allocated; 0 bytes free; 13.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

I modified the modelling_llama.py by adding import os os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:2000' also tried with 'max_split_size_mb:4000'

Any suggestions for addressing this issue? Thank you.

mpetruc avatar Apr 30 '23 15:04 mpetruc

have you tried with --load-8bit ? Even the GPU has 16GB memory, you can only use around 85% of it.

Chesterguan avatar May 04 '23 21:05 Chesterguan

I'm trying to run FastChat in a CUDA Docker Image and I have the same issue with an RTX 2070 8Gb:

OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 7.78 GiB total capacity; 6.31 GiB already allocated; 62.44 MiB free; 6.31 GiB reserved in total by PyTorch) If reserved memory is >> 
allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried to run it with --load-8bit but I have the same error, command:

python3 -m fastchat.serve.cli --load-8bit --model-path /app/models/vicuna-7b

ivangabriele avatar May 12 '23 21:05 ivangabriele

@mpetruc this looks like some other process took over GPU memory. Did you check with nvidia-smi if there was something there?

Is it still an issue for you?

surak avatar Oct 21 '23 16:10 surak