CTranslate2
CTranslate2 copied to clipboard
Splitting LLM layers across multiple GPUs
As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here: https://huggingface.co/docs/transformers/v4.15.0/parallelism
Yes, it would be great to implement tensor parallelism for large models.
Right now we support data parallelism on the GPU. We refer to it simply as "parallel execution" in the documentation.
I would very much appreciate it if tensor parallelism could be implemented. Tried llama-2-13 on 2 RTX 3090 in fp16 and got OOM - and for sure 8bit works fine on one GPU
I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator model,... I appreciate if you could test this feature with others models or give some suggestions about principle models to test.
I do some tests with Llama2:
Nb Machine | GPUs | type of GPUs | Batch size | Perf (token/sec) | GPU memory | quantization | model |
---|---|---|---|---|---|---|---|
1 | 1 | Tesla V100-PCIE-16GB | 1 | 46.9 | 7352MB | Yes | llama 7b |
1 | 2 | Tesla V100-PCIE-16GB | 1 | 51.5 | 3848MB | Yes | llama 7b |
2 | 4 | Tesla V100-PCIE-16GB | 1 | 17.8 | 2280MB | Yes | llama 7b |
1 | 1 | Tesla V100-PCIE-16GB | 5 | 185.3 | 7352MB | Yes | llama 7b |
1 | 2 | Tesla V100-PCIE-16GB | 5 | 176 | 3848MB | Yes | llama 7b |
2 | 4 | Tesla V100-PCIE-16GB | 5 | 62 | 2280MB | Yes | llama 7b |
1 | 1 | Tesla V100-PCIE-16GB | 1 | 43.3 | 13880MB | No | llama 7b |
1 | 2 | Tesla V100-PCIE-16GB | 1 | 66.5 | 7240MB | No | llama 7b |
2 | 4 | Tesla V100-PCIE-16GB | 1 | 31.9 | 4136MB | No | llama 7b |
1 | 1 | Tesla V100-PCIE-16GB | 5 | 179.3 | 13880M | No | llama 7b |
1 | 2 | Tesla V100-PCIE-16GB | 5 | 249.5 | 7240MB | No | llama 7b |
2 | 4 | Tesla V100-PCIE-16GB | 5 | 101.7 | 4136MB | No | llama 7b |
If the GPUs are in the same machine, the inference shows better performance. On the other hand, if GPUs are on different machines, we have lower performance due to the latency in the network.
did you run 5 samples in batch_size = 1 sentence or did you run batch_size = 5 sentences ?
I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5.
I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue.