qlora Multiple GPU inference

Multiple GPU inference

Open Zheng392 opened this issue 2 years ago • 1 comments

I do inference of Llama 70b model using 4 16G V100 GPU. I just use model.generate() to generate content. But I found only one GPU is fully utilized each time. Since 70b model requires at least 40G VRAM to load it, I can't do data parallelism. How can I utilize 4 GPUs fully to increase the speed?

Aug 09 '23 20:08 Zheng392

The way "accelerate" works is by putting different network layers on different GPUs. When you input your data, it gets processed layer by layer, gpu by gpu.

Aug 18 '23 19:08 phalexo

qlora qlora copied to clipboard

Multiple GPU inference

qlora
qlora copied to clipboard