text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Multiple GPU, horrendous speed

Open Manimap opened this issue 1 year ago • 1 comments

Describe the bug

Hi, I'm using windows with main gpu being a 4090, and I got a 3090 as a secondary one. These are the parameters I use to load a 30B model : --xformers --notebook --model-menu --wbits 4 --model_type Llama --auto-devices --gpu-memory 0 23 The model seems to load well in the second gpu, but the speeds I get are horrendously slow, 0.3 tokens/s When I'm using only the main gpu with --gpu-memory 22 0 I get better speed, around 4 tokens/s And when I try to use both of them, --gpu-memory 19 19, I get super slow speed again

Also, when I try to use something eating more vram like contrastive search, it just goes out of memory instead of trying to use the free vram in any of the gpus.

Is there any config I missed to make it run fast on windows, or is it inherently slow?

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Screenshot

No response

Logs

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 0; 23.99 GiB total capacity; 22.29 GiB already allocated; 0 bytes free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

System Info

windows 10
64GB ram
5950x
gpu 1 4090
gpu 2 3090

Manimap avatar May 06 '23 17:05 Manimap

call python server.py --auto-devices --chat --sdp-attention --model-menu

also see your own log file.

what model are you running? wizardlm?

Tom-Neverwinter avatar May 08 '23 05:05 Tom-Neverwinter

I'm using 30B models in 4 bits, but I think I'll retry again using linux. So I'll close this issue and try on linux to see if it's better.

Manimap avatar May 11 '23 00:05 Manimap