text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Possible Bug with setting number of threads when splitting between CPU and GPU? --threads does not work?

Open viperwasp opened this issue 1 year ago • 2 comments

Describe the bug

I think my 12700K is stuck on 2 threads or something. CPU is at most 22-25% performance. I am running a 30B Alpaca model it is working. But is very slowly. I am running the alpaca-30b-lora-int4 model named "alpaca-30b-4bit-128g.safetensors"

Here is my command line "python server.py --auto-devices --chat --pre_layer 36 --model alpaca-30b-lora-int4 --wbits 4 --groupsize 128 --model_type llama --threads 16"

So I am splitting between GPU and CPU. Is it working as intended or is there something I am doing wrong? I want it to run on 16 threads. I know I have 20 threads. Thanks. I just set threads to 20 and it's 18% CPU use while generating tokens?

If this is not a bug and is intended please close this inquiry. Thank you.

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

It's happening every time. At least one other person is having the same issue as seen here. https://github.com/oobabooga/text-generation-webui/discussions/1589

Screenshot

No response

Logs

I don't think there are logs I can provide for this?

System Info

Windows 10 Home
I7 12700K 
RTX 4080 16GB
64GB DDR4 Memory 
M.2 Gen 4 SSD 
Running locally in Ooba. Ooba fully updated using "git pull https://github.com/oobabooga/text-generation-webui.git" 
I will say that I am using an older install of Ooba that uses mamba and not conda. If that is the possible issue let me know.

viperwasp avatar Apr 30 '23 17:04 viperwasp

The --threads option says it only pertains to llama.cpp, and best I know it runs (only) the GGML CPU-only models. In that mode I've seen it use all of 16 cores, though it seems like more than four or so doesn't increase the tokens/sec much.

I don't know a way to make CPU offloading of GPU models like the one you're testing to multi-thread on the CPU portion, nor what would be involved to do that.

I see typically 250% CPU when loading a model, then a steady 103-106% during inference, and GPU variable from 0 to 100%. Are those 18-25% CPU numbers you mention percentages of all available cores, or just one?

dblacknc avatar Apr 30 '23 21:04 dblacknc

Percentages of overall. In reality I think 2 cores or more so threads I think are 100%. It would be ideal to allow more threads. But if it's a technical impossibility than so be it. I don't understand this stuff to much.

viperwasp avatar Apr 30 '23 21:04 viperwasp

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar May 30 '23 23:05 github-actions[bot]

I'm running a ggml model only on cpu and I have the exact same problem. CPU usage gets stuck at 20% and doesn't generate anything, and on the performance tab of the task manager it seems only two threads are half active. Running it with KoboldCPP works fine.

DavidRL77 avatar Jul 05 '23 20:07 DavidRL77