candle
candle copied to clipboard
How to run inference of a (very) large model across mulitple GPUs ?
It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example to look at..
Also, I know PyTorch has things like DDP and FSDP, is candle support for multi GPU inference comparable to these techniques ?
Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:
https://github.com/huggingface/candle/blob/f48c07e2428a6d777ffdea57a2d1ac6a7d58a8ee/candle-examples/examples/llama_multiprocess/model.rs#L293-L308
Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:
https://github.com/huggingface/candle/blob/f48c07e2428a6d777ffdea57a2d1ac6a7d58a8ee/candle-examples/examples/llama_multiprocess/model.rs#L293-L308
That example is for a single node. How about multiple nodes? Can we just run the example with mpirun -n 2 --hostfile ../../hostfile target/release/llama_multiprocess 2 2000
Update:
I guess I must modify the code to support the world rank for MPI. I think sticking to NCCL as a backend might be better, but then is there support in Cudarc for cross-node communication?
Found this library https://github.com/oddity-ai/async-cuda
I started a draft here for the splitting a model across multiple GPUs on different nodes. There is a mapping feature as I linked above on mistral.rs
repo
- https://github.com/huggingface/candle/issues/1936
I am having the same question.