text-generation-webui 4bit offload to ram

4bit offload to ram

Open Silver267 opened this issue 1 year ago • 0 comments

Is it theoretically possible to send pre-quantized 4bit llama layers to RAM to reduce ram usage & improve i/o performance? Currently, offloading a 33b model to ram will require 64gb+ of ram, and that could be significantly reduced by storing 4bit layers into ram and streaming them into GPU only when needed. According to my observation, the main bottleneck of model offloading should be i/o performance rather than computation. https://github.com/gmorenz/llama/tree/ssd < could be useful when implementing.

Mar 11 '23 18:03 Silver267

text-generation-webui text-generation-webui copied to clipboard

4bit offload to ram

text-generation-webui
text-generation-webui copied to clipboard