qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Inference Time 2X Slow

Open lukaswangbk opened this issue 2 years ago • 1 comments

Hi all,

When I perform finetuned model inference on 2 GPUs and load in 4bit, the speed is 2X slower compared with the original model after 4 bit quantization.

The model I used is MOSS and the reason use 2 GPUs for inference is OOM issue. I wonder why this happens. Really hope u could help me out

lukaswangbk avatar Jun 02 '23 04:06 lukaswangbk

@lukaswangbk did you find a solution?

zacharyblank avatar Jun 06 '23 22:06 zacharyblank