qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Can inference be done with CPU and if so, is it painfully slow?

Open flesler opened this issue 2 years ago • 2 comments

Assuming a not very large model, like 7b-13b, is it feasible to infer with CPU?

flesler avatar Jun 01 '23 16:06 flesler

4 bit seem around 2x slower than fp16 on GPU.

Oxi84 avatar Jun 01 '23 17:06 Oxi84

Ok, the question is about CPU though, not GPU

flesler avatar Jun 01 '23 17:06 flesler

QLoRA is primarily a memory efficient method.

4-bit QLoRA training speed is roughly on par with 16-bit LoRA. However, it is true that inference is currently slow. We are working on improving 4-bit inference.

Regarding the original question, unfortunately, QLoRA only works on GPU.

artidoro avatar Jun 01 '23 19:06 artidoro

Thanks :+1:

flesler avatar Jun 01 '23 19:06 flesler

QLoRA is primarily a memory efficient method.

4-bit QLoRA training speed is roughly on par with 16-bit LoRA. However, it is true that inference is currently slow. We are working on improving 4-bit inference.

Regarding the original question, unfortunately, QLoRA only works on GPU.

It is already impressive to make it use 8x less memory than fp32, so we can train 30B model on a 24GB card.

For 4bit inteference, is it slow because you need to de-quantize some values to fp16 or fp32 for neural network?

Oxi84 avatar Jun 02 '23 12:06 Oxi84

Assuming a not very large model, like 7b-13b, is it feasible to infer with CPU?

Llama c++ is probably the best for CPU, once you train it via qlora and merge the lora weight into a full-size model. But I guess it would be less than 20 tokens/sec for 7B on 13900K. rtx4900K (with auto-gptq) for 7B gets around 100 tokens/sec maybe even more, and you can do multiple batches in one request to get larger effective speed.

Oxi84 avatar Jun 02 '23 12:06 Oxi84

Llama c++ is probably the best for CPU, once you train it via qlora and merge the lora weight into a full-size model.

@Oxi84 that sounds interesting, do you have a link where that’s explained? would the resulting model still be fp4? Otherwise what’s the point of using qlora?

flesler avatar Jun 02 '23 13:06 flesler

I am also interested to know how to merge QLoRA back into a full-sized model. I tried using model.merge_and_unload but it didn't work when pushing to Hugging Face / saving using Torch.

Does anyone have such an example?

tabacof avatar Jun 09 '23 14:06 tabacof

Hi all. Wondering why does QLora only work for GPU, bot not for CPUs? @artidoro Thanks!

LeoPerelli avatar May 29 '24 15:05 LeoPerelli