candle
candle copied to clipboard
Adapting llama_multiprocess to use `rsmpi`
Has anyone considered adapting llama_multiprocess to run on multiple machines instead of multiple processes? I've started by using the SystemCommunicator from rsmpi library to replace nccl::Comm, but the debugging seems pretty obscure. I managed to get it running but the last step hung forever. I can share more details.
Oh nice, and why not using NCCL for multi node ?
I haven't checked but the bindings to NCCL are pretty agnostic it should be easy to setup.
I know NCCL is not the solution to everything, so it would still be very interesting to see what rsmpi could bring !
Oh nice, and why not using NCCL for multi node ?
I haven't checked but the bindings to NCCL are pretty agnostic it should be easy to setup. I know NCCL is not the solution to everything, so it would still be very interesting to see what
rsmpicould bring !
Oh! Can i use nccl for cross node communication? I was just leaning into rsmpi to serve as my cross node and then nccl would be for the actual compute.
Here is some really rough code: https://github.com/b0xtch/octo/tree/main/src
Can i use nccl for cross node communication?
It's one of the big selling points ! https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ I'm not a big user of multi nodes but it should work pretty much out of the box. (There's a lot of configurations that goes into tweaking for performance)
Can i use nccl for cross node communication?
It's one of the big selling points ! https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ I'm not a big user of multi nodes but it should work pretty much out of the box. (There's a lot of configurations that goes into tweaking for performance)
This is great, thanks. I’ll update here after I tried a few things
@b0xtch Did you manage to get llama_multiprocess running on a multi node setup with NCCL?
@b0xtch Did you manage to get llama_multiprocess running on a multi node setup with NCCL?
I started it a while back, but I have been blocked by the flash attention problem on cuda > v12.x. I haven't been able to finish testing it. Once I fix the attention problem, I'll get back to this.
I created a draft PR for now: https://github.com/huggingface/candle/pull/2292