candle How to run inference of a (very) large model across mulitple GPUs ?

How to run inference of a (very) large model across mulitple GPUs ?

Open jorgeantonio21 opened this issue 10 months ago • 4 comments

It is mentioned on README that candle supports multi GPU inference, using NCCL under the hood. How can this be implemented ? I wonder if there is any available example to look at..

Also, I know PyTorch has things like DDP and FSDP, is candle support for multi GPU inference comparable to these techniques ?

Apr 04 '24 13:04 jorgeantonio21

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

https://github.com/huggingface/candle/blob/f48c07e2428a6d777ffdea57a2d1ac6a7d58a8ee/candle-examples/examples/llama_multiprocess/model.rs#L293-L308

Apr 04 '24 14:04 EricLBuehler

Please see the llama multiprocess example. The multi-GPU inference is used to create parellelized linear layers:

https://github.com/huggingface/candle/blob/f48c07e2428a6d777ffdea57a2d1ac6a7d58a8ee/candle-examples/examples/llama_multiprocess/model.rs#L293-L308

That example is for a single node. How about multiple nodes? Can we just run the example with mpirun -n 2 --hostfile ../../hostfile target/release/llama_multiprocess 2 2000

Update:

I guess I must modify the code to support the world rank for MPI. I think sticking to NCCL as a backend might be better, but then is there support in Cudarc for cross-node communication?

Found this library https://github.com/oddity-ai/async-cuda

Apr 12 '24 20:04 b0xtch

I started a draft here for the splitting a model across multiple GPUs on different nodes. There is a mapping feature as I linked above on mistral.rs repo

https://github.com/huggingface/candle/issues/1936

Jun 28 '24 19:06 b0xtch

I am having the same question.

Aug 12 '24 04:08 deema-A

candle candle copied to clipboard

How to run inference of a (very) large model across mulitple GPUs ?

candle
candle copied to clipboard