text-generation-webui
text-generation-webui copied to clipboard

Published 20 hours ago •

Reame
Issues

Support for CPU-Only Systems with limited RAM

Open uff-valeu opened this issue 1 year ago • 4 comments

Currently, it seems like the app always uses FP32, requiring a lot of memory when loading models on CPU. Also, the --cpu-memory argument seems to get ignored and --disk doesn't work either.

llama-7B is the biggest I could use on a server with 32GB allocated. I'm not sure if 8bit or even fp16 is even supported on CPU, but maybe at least the disk offload could be fixed?

Mar 13 '23 08:03 uff-valeu

Which models are you using? The llama models appear to work only on GPU for me, despite having 48GB of RAM

Mar 13 '23 11:03 RazeLighter777

RWKV proves there is FP32i8

Mar 13 '23 12:03 Ph0rk0z

About precision: pytorch can only do fp32 in CPU mode (as far as I am aware).

Mar 13 '23 18:03 oobabooga

I ran 4-bit quantized 7B with llama.cpp and the whole system was only using just a bit over 16 GB of RAM but it was ultra slow.

Mar 16 '23 18:03 Alcyon6

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Apr 15 '23 23:04 github-actions[bot]