CTranslate2
CTranslate2 copied to clipboard
Splitting LLM layers across multiple GPUs
As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here: https://huggingface.co/docs/transformers/v4.15.0/parallelism