text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

CPU mode only increase core count

Open Destroyer opened this issue 1 year ago • 5 comments

I am trying to get llama running on this in CPU mode (24 core epyc). Running with python3 server.py --model llama-13b --load-in-8bit --no-stream --cpu single response takes about 300 seconds for 200 tokens

However in this setup, it only uses a single CPU core and I don't see any arguments to use to increase this value. Can this be increased in any other way or I am hitting the limitations of python?

Destroyer avatar Mar 12 '23 23:03 Destroyer

Try removing --load-in-8bit, this is not meant to be used in CPU mode.

oobabooga avatar Mar 12 '23 23:03 oobabooga

Unfortunately removing this param doesn't change anything. It still pegs just one cpu core Before Output generated in 317.38 seconds (0.63 tokens/s, 200 tokens) Output generated in 328.47 seconds (0.61 tokens/s, 200 tokens) Output generated in 314.88 seconds (0.64 tokens/s, 200 tokens)

After Output generated in 316.73 seconds (0.63 tokens/s, 200 tokens)

Anything else I can try?

Destroyer avatar Mar 12 '23 23:03 Destroyer

It's weird that it's using just 1 core. Last year I used to use CPU mode a lot, and what I noted is that pytorch only uses 50% of the CPU cores. I have just made a test with llama-7b and the behavior was the same.

There is a workaround to force usage of all CPU cores, but it didn't lead to any improvement in performance. If you search through the past issues you can probably find something about that.

What is your OS?

oobabooga avatar Mar 13 '23 00:03 oobabooga

Nothing exotic - Debian 11.6

Destroyer avatar Mar 13 '23 00:03 Destroyer

I did try adding the torch.set_num_threads(24) to modules/text_generation.py from https://github.com/oobabooga/text-generation-webui/issues/8

image but I am still seeing the same thing

Destroyer avatar Mar 13 '23 00:03 Destroyer

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Apr 13 '23 16:04 github-actions[bot]

has there been any updates ? I am running into the same issue using Arch. I have a Server with 32 cores but just 12 cores beeing used with the model deepseek-coder-6.7b-instruct.Q4_K_M.gguf

hadadadebadada avatar Jan 22 '24 13:01 hadadadebadada

i am using ollama instead, which does do a much better job at CPU utilization, however sometimes it gets stuck and never produces an output, so it's a tradeoff

Destroyer avatar Apr 21 '24 00:04 Destroyer