text-generation-webui
text-generation-webui copied to clipboard
Multiple GPU, horrendous speed
Describe the bug
Hi, I'm using windows with main gpu being a 4090, and I got a 3090 as a secondary one. These are the parameters I use to load a 30B model : --xformers --notebook --model-menu --wbits 4 --model_type Llama --auto-devices --gpu-memory 0 23 The model seems to load well in the second gpu, but the speeds I get are horrendously slow, 0.3 tokens/s When I'm using only the main gpu with --gpu-memory 22 0 I get better speed, around 4 tokens/s And when I try to use both of them, --gpu-memory 19 19, I get super slow speed again
Also, when I try to use something eating more vram like contrastive search, it just goes out of memory instead of trying to use the free vram in any of the gpus.
Is there any config I missed to make it run fast on windows, or is it inherently slow?
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Screenshot
No response
Logs
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 0; 23.99 GiB total capacity; 22.29 GiB already allocated; 0 bytes free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
System Info
windows 10
64GB ram
5950x
gpu 1 4090
gpu 2 3090
call python server.py --auto-devices --chat --sdp-attention --model-menu
also see your own log file.
what model are you running? wizardlm?
I'm using 30B models in 4 bits, but I think I'll retry again using linux. So I'll close this issue and try on linux to see if it's better.