text-generation-webui
text-generation-webui copied to clipboard
Support for CPU-Only Systems with limited RAM
Currently, it seems like the app always uses FP32, requiring a lot of memory when loading models on CPU. Also, the --cpu-memory
argument seems to get ignored and --disk
doesn't work either.
llama-7B is the biggest I could use on a server with 32GB allocated. I'm not sure if 8bit or even fp16 is even supported on CPU, but maybe at least the disk offload could be fixed?
Which models are you using? The llama models appear to work only on GPU for me, despite having 48GB of RAM
RWKV proves there is FP32i8
About precision: pytorch can only do fp32 in CPU mode (as far as I am aware).
I ran 4-bit quantized 7B with llama.cpp and the whole system was only using just a bit over 16 GB of RAM but it was ultra slow.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.