text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

4bit offload to ram

Open Silver267 opened this issue 1 year ago • 0 comments

Is it theoretically possible to send pre-quantized 4bit llama layers to RAM to reduce ram usage & improve i/o performance? Currently, offloading a 33b model to ram will require 64gb+ of ram, and that could be significantly reduced by storing 4bit layers into ram and streaming them into GPU only when needed. According to my observation, the main bottleneck of model offloading should be i/o performance rather than computation. https://github.com/gmorenz/llama/tree/ssd < could be useful when implementing.

Silver267 avatar Mar 11 '23 18:03 Silver267