CTranslate2
CTranslate2 copied to clipboard
Distributed mode
It's present or is planned a way to run inference in distribuited mode on multiple different machines?
Can you specify what you mean exactly? Do you mean splitting the model on multiple machines (model/tensor parallelism), or loading the same model on multiple machines (data parallelism)?
I mean model parallelism, i.e when the model is too big to fit in a single machine vram
Yes I also want to know whether this is possible. Can we have something similar to what device_map="auto" does?
As far as I know, device_map="auto" will not load a model on multiple machines. To load the model on multiple GPUs (on the same machine), see the existing issue #1052.
I closed this issue as completed. The tensor parallelism is currently supported in Ctranslate2.