Can inference be done with CPU and if so, is it painfully slow?
Assuming a not very large model, like 7b-13b, is it feasible to infer with CPU?
4 bit seem around 2x slower than fp16 on GPU.
Ok, the question is about CPU though, not GPU
QLoRA is primarily a memory efficient method.
4-bit QLoRA training speed is roughly on par with 16-bit LoRA. However, it is true that inference is currently slow. We are working on improving 4-bit inference.
Regarding the original question, unfortunately, QLoRA only works on GPU.
Thanks :+1:
QLoRA is primarily a memory efficient method.
4-bit QLoRA training speed is roughly on par with 16-bit LoRA. However, it is true that inference is currently slow. We are working on improving 4-bit inference.
Regarding the original question, unfortunately, QLoRA only works on GPU.
It is already impressive to make it use 8x less memory than fp32, so we can train 30B model on a 24GB card.
For 4bit inteference, is it slow because you need to de-quantize some values to fp16 or fp32 for neural network?
Assuming a not very large model, like 7b-13b, is it feasible to infer with CPU?
Llama c++ is probably the best for CPU, once you train it via qlora and merge the lora weight into a full-size model. But I guess it would be less than 20 tokens/sec for 7B on 13900K. rtx4900K (with auto-gptq) for 7B gets around 100 tokens/sec maybe even more, and you can do multiple batches in one request to get larger effective speed.
Llama c++ is probably the best for CPU, once you train it via qlora and merge the lora weight into a full-size model.
@Oxi84 that sounds interesting, do you have a link where that’s explained? would the resulting model still be fp4? Otherwise what’s the point of using qlora?
I am also interested to know how to merge QLoRA back into a full-sized model. I tried using model.merge_and_unload but it didn't work when pushing to Hugging Face / saving using Torch.
Does anyone have such an example?
Hi all. Wondering why does QLora only work for GPU, bot not for CPUs? @artidoro Thanks!