qlora icon indicating copy to clipboard operation
qlora copied to clipboard

Multiple GPU inference

Open Zheng392 opened this issue 2 years ago • 1 comments

I do inference of Llama 70b model using 4 16G V100 GPU. I just use model.generate() to generate content. But I found only one GPU is fully utilized each time. Since 70b model requires at least 40G VRAM to load it, I can't do data parallelism. How can I utilize 4 GPUs fully to increase the speed?

image

Zheng392 avatar Aug 09 '23 20:08 Zheng392

The way "accelerate" works is by putting different network layers on different GPUs. When you input your data, it gets processed layer by layer, gpu by gpu.

phalexo avatar Aug 18 '23 19:08 phalexo