Quanto Support for Fast LoRA loading and switching Functions

Open Metal079 opened this issue 1 year ago • 0 comments

🚀 The feature, motivation and pitch

I keep several models in VRAM at once to increase inference speed when switching between models on my website and i'd like to be able to use Quanto https://github.com/huggingface/optimum-quanto to decrease the vram usage but I found that the OneDiffX Fast LoRA loading and switching functions are not compatible with models that are quantizied using quanto, is this possibly something that be looked into supporting?

Alternatives

PEFT lora loading works, but it is much slower, what would take ~8 seconds for PEFT to load, OneDiffX can do it in 2 seconds or less.

Additional context

No response

Sep 10 '24 16:09 Metal079