serving with multi-GPU

Open richardzhuang0412 opened this issue 1 year ago • 2 comments

I was testing "litgpt serve" for llama-3-70b using 4 A100 80G and I receive OOM error. I tried the same command on llama-2-13b and it seems like specifying the "devices" argument only load multiple replicas of the same model but not distributing the memory. Is there any way to do multi-gpu serving with the model?

Jun 12 '24 19:06 richardzhuang0412

Unfortunately, multi-GPU inference is not supported yet, but that's something on the roadmap.

Jun 12 '24 21:06 rasbt

There is a generate script that uses tensor parallel for inference on multi-GPU. You could adapt this one for serving.

Jun 17 '24 04:06 awaelchli