QLLM
QLLM copied to clipboard
suggestion, make quantization possible to offload to disk instead of ram
Using quants on larger models is impossible due to high memory requirements, so I was thinking about the idea of building a method to offload memory to disk. Since relying on the swap system of the OS makes absolutely everything halt, the mechanism for such a thing has to be managed manually.
Methods like HQQ are great, but only when they fit in ram, and the same goes for all the other methods. Since it's a general problem, I wonder if it is possible to somehow generalize this problem. Just an idea for discussion.
Good question. We can do this definitely. The Transformer API AutoModelForCausalLM.from_pretrain has a args named device_map doing what you mentioned. It support load the weight into GRAM/VRAM/DISK seperately.
It needs some works to refactor with this. Would you mind taking this?
Sure, I will take a look at it, and see what I can do :-)
Thank you.