CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Splitting LLM layers across multiple GPUs

Open JOHW85 opened this issue 1 year ago • 3 comments

As CTranslate2 now supports quantized 8-bit LLMs like OPT, are there any plans to include model parallelism to split a model layers across multiple GPUs or GPU+CPU to meet the memory requirements needed to load the model as described here: https://huggingface.co/docs/transformers/v4.15.0/parallelism

JOHW85 avatar Jan 22 '23 18:01 JOHW85