text-generation-webui
text-generation-webui copied to clipboard
4bit offload to ram
Is it theoretically possible to send pre-quantized 4bit llama layers to RAM to reduce ram usage & improve i/o performance? Currently, offloading a 33b model to ram will require 64gb+ of ram, and that could be significantly reduced by storing 4bit layers into ram and streaming them into GPU only when needed. According to my observation, the main bottleneck of model offloading should be i/o performance rather than computation. https://github.com/gmorenz/llama/tree/ssd < could be useful when implementing.