QLLM icon indicating copy to clipboard operation
QLLM copied to clipboard

suggestion, make quantization possible to offload to disk instead of ram

Open Nidvogr opened this issue 1 year ago • 3 comments

Using quants on larger models is impossible due to high memory requirements, so I was thinking about the idea of building a method to offload memory to disk. Since relying on the swap system of the OS makes absolutely everything halt, the mechanism for such a thing has to be managed manually.

Methods like HQQ are great, but only when they fit in ram, and the same goes for all the other methods. Since it's a general problem, I wonder if it is possible to somehow generalize this problem. Just an idea for discussion.

Nidvogr avatar Apr 16 '24 17:04 Nidvogr

Good question. We can do this definitely. The Transformer API AutoModelForCausalLM.from_pretrain has a args named device_map doing what you mentioned. It support load the weight into GRAM/VRAM/DISK seperately.

It needs some works to refactor with this. Would you mind taking this?

wejoncy avatar Apr 17 '24 03:04 wejoncy

Sure, I will take a look at it, and see what I can do :-)

Nidvogr avatar Apr 20 '24 19:04 Nidvogr

Thank you.

wejoncy avatar Apr 22 '24 01:04 wejoncy