text-generation-webui
text-generation-webui copied to clipboard
Possible Bug with setting number of threads when splitting between CPU and GPU? --threads does not work?
Describe the bug
I think my 12700K is stuck on 2 threads or something. CPU is at most 22-25% performance. I am running a 30B Alpaca model it is working. But is very slowly. I am running the alpaca-30b-lora-int4 model named "alpaca-30b-4bit-128g.safetensors"
Here is my command line "python server.py --auto-devices --chat --pre_layer 36 --model alpaca-30b-lora-int4 --wbits 4 --groupsize 128 --model_type llama --threads 16"
So I am splitting between GPU and CPU. Is it working as intended or is there something I am doing wrong? I want it to run on 16 threads. I know I have 20 threads. Thanks. I just set threads to 20 and it's 18% CPU use while generating tokens?
If this is not a bug and is intended please close this inquiry. Thank you.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
It's happening every time. At least one other person is having the same issue as seen here. https://github.com/oobabooga/text-generation-webui/discussions/1589
Screenshot
No response
Logs
I don't think there are logs I can provide for this?
System Info
Windows 10 Home
I7 12700K
RTX 4080 16GB
64GB DDR4 Memory
M.2 Gen 4 SSD
Running locally in Ooba. Ooba fully updated using "git pull https://github.com/oobabooga/text-generation-webui.git"
I will say that I am using an older install of Ooba that uses mamba and not conda. If that is the possible issue let me know.
The --threads option says it only pertains to llama.cpp, and best I know it runs (only) the GGML CPU-only models. In that mode I've seen it use all of 16 cores, though it seems like more than four or so doesn't increase the tokens/sec much.
I don't know a way to make CPU offloading of GPU models like the one you're testing to multi-thread on the CPU portion, nor what would be involved to do that.
I see typically 250% CPU when loading a model, then a steady 103-106% during inference, and GPU variable from 0 to 100%. Are those 18-25% CPU numbers you mention percentages of all available cores, or just one?
Percentages of overall. In reality I think 2 cores or more so threads I think are 100%. It would be ideal to allow more threads. But if it's a technical impossibility than so be it. I don't understand this stuff to much.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.
I'm running a ggml model only on cpu and I have the exact same problem. CPU usage gets stuck at 20% and doesn't generate anything, and on the performance tab of the task manager it seems only two threads are half active. Running it with KoboldCPP works fine.