CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Splitting LLM layers across multiple GPUs

Open JOHW85 opened this issue 2 years ago • 3 comments

As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here: https://huggingface.co/docs/transformers/v4.15.0/parallelism

JOHW85 avatar Jan 22 '23 18:01 JOHW85

Yes, it would be great to implement tensor parallelism for large models.

Right now we support data parallelism on the GPU. We refer to it simply as "parallel execution" in the documentation.

guillaumekln avatar Jan 23 '23 15:01 guillaumekln

I would very much appreciate it if tensor parallelism could be implemented. Tried llama-2-13 on 2 RTX 3090 in fp16 and got OOM - and for sure 8bit works fine on one GPU

DHOFM avatar Sep 14 '23 12:09 DHOFM

I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator model,... I appreciate if you could test this feature with others models or give some suggestions about principle models to test.

I do some tests with Llama2:

Nb Machine GPUs type of GPUs Batch size Perf (token/sec) GPU memory quantization model
1 1 Tesla V100-PCIE-16GB 1 46.9 7352MB Yes llama 7b
1 2 Tesla V100-PCIE-16GB 1 51.5 3848MB Yes llama 7b
2 4 Tesla V100-PCIE-16GB 1 17.8 2280MB Yes llama 7b
1 1 Tesla V100-PCIE-16GB 5 185.3 7352MB Yes llama 7b
1 2 Tesla V100-PCIE-16GB 5 176 3848MB Yes llama 7b
2 4 Tesla V100-PCIE-16GB 5 62 2280MB Yes llama 7b
1 1 Tesla V100-PCIE-16GB 1 43.3 13880MB No llama 7b
1 2 Tesla V100-PCIE-16GB 1 66.5 7240MB No llama 7b
2 4 Tesla V100-PCIE-16GB 1 31.9 4136MB No llama 7b
1 1 Tesla V100-PCIE-16GB 5 179.3 13880M No llama 7b
1 2 Tesla V100-PCIE-16GB 5 249.5 7240MB No llama 7b
2 4 Tesla V100-PCIE-16GB 5 101.7 4136MB No llama 7b

If the GPUs are in the same machine, the inference shows better performance. On the other hand, if GPUs are on different machines, we have lower performance due to the latency in the network.

minhthuc2502 avatar Mar 01 '24 15:03 minhthuc2502

did you run 5 samples in batch_size = 1 sentence or did you run batch_size = 5 sentences ?

vince62s avatar Mar 01 '24 17:03 vince62s

I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5.

minhthuc2502 avatar Mar 02 '24 11:03 minhthuc2502

I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue.

minhthuc2502 avatar Mar 05 '24 16:03 minhthuc2502