litgpt
litgpt copied to clipboard
serving with multi-GPU
I was testing "litgpt serve" for llama-3-70b using 4 A100 80G and I receive OOM error. I tried the same command on llama-2-13b and it seems like specifying the "devices" argument only load multiple replicas of the same model but not distributing the memory. Is there any way to do multi-gpu serving with the model?
Unfortunately, multi-GPU inference is not supported yet, but that's something on the roadmap.
There is a generate script that uses tensor parallel for inference on multi-GPU. You could adapt this one for serving.