text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Support for CPU-Only Systems with limited RAM

Open uff-valeu opened this issue 1 year ago • 4 comments

Currently, it seems like the app always uses FP32, requiring a lot of memory when loading models on CPU. Also, the --cpu-memory argument seems to get ignored and --disk doesn't work either.

llama-7B is the biggest I could use on a server with 32GB allocated. I'm not sure if 8bit or even fp16 is even supported on CPU, but maybe at least the disk offload could be fixed?

uff-valeu avatar Mar 13 '23 08:03 uff-valeu

Which models are you using? The llama models appear to work only on GPU for me, despite having 48GB of RAM

RazeLighter777 avatar Mar 13 '23 11:03 RazeLighter777

RWKV proves there is FP32i8

Ph0rk0z avatar Mar 13 '23 12:03 Ph0rk0z

About precision: pytorch can only do fp32 in CPU mode (as far as I am aware).

oobabooga avatar Mar 13 '23 18:03 oobabooga

I ran 4-bit quantized 7B with llama.cpp and the whole system was only using just a bit over 16 GB of RAM but it was ultra slow.

Alcyon6 avatar Mar 16 '23 18:03 Alcyon6

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Apr 15 '23 23:04 github-actions[bot]