CTranslate2 Splitting LLM layers across multiple GPUs

As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here: https://huggingface.co/docs/transformers/v4.15.0/parallelism

Jan 22 '23 18:01 JOHW85

Yes, it would be great to implement tensor parallelism for large models.

Right now we support data parallelism on the GPU. We refer to it simply as "parallel execution" in the documentation.

Jan 23 '23 15:01 guillaumekln

I would very much appreciate it if tensor parallelism could be implemented. Tried llama-2-13 on 2 RTX 3090 in fp16 and got OOM - and for sure 8bit works fine on one GPU

Sep 14 '23 12:09 DHOFM

I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator model,... I appreciate if you could test this feature with others models or give some suggestions about principle models to test.

I do some tests with Llama2:

Nb Machine	GPUs	type of GPUs	Batch size	Perf (token/sec)	GPU memory	quantization	model
1	1	Tesla V100-PCIE-16GB	1	46.9	7352MB	Yes	llama 7b
1	2	Tesla V100-PCIE-16GB	1	51.5	3848MB	Yes	llama 7b
2	4	Tesla V100-PCIE-16GB	1	17.8	2280MB	Yes	llama 7b
1	1	Tesla V100-PCIE-16GB	5	185.3	7352MB	Yes	llama 7b
1	2	Tesla V100-PCIE-16GB	5	176	3848MB	Yes	llama 7b
2	4	Tesla V100-PCIE-16GB	5	62	2280MB	Yes	llama 7b
1	1	Tesla V100-PCIE-16GB	1	43.3	13880MB	No	llama 7b
1	2	Tesla V100-PCIE-16GB	1	66.5	7240MB	No	llama 7b
2	4	Tesla V100-PCIE-16GB	1	31.9	4136MB	No	llama 7b
1	1	Tesla V100-PCIE-16GB	5	179.3	13880M	No	llama 7b
1	2	Tesla V100-PCIE-16GB	5	249.5	7240MB	No	llama 7b
2	4	Tesla V100-PCIE-16GB	5	101.7	4136MB	No	llama 7b

If the GPUs are in the same machine, the inference shows better performance. On the other hand, if GPUs are on different machines, we have lower performance due to the latency in the network.

Mar 01 '24 15:03 minhthuc2502

did you run 5 samples in batch_size = 1 sentence or did you run batch_size = 5 sentences ?

Mar 01 '24 17:03 vince62s

I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5.

Mar 02 '24 11:03 minhthuc2502

I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue.

Mar 05 '24 16:03 minhthuc2502

CTranslate2 CTranslate2 copied to clipboard

Splitting LLM layers across multiple GPUs

CTranslate2
CTranslate2 copied to clipboard