candle icon indicating copy to clipboard operation
candle copied to clipboard

Adapting llama_multiprocess to use `rsmpi`

Open b0xtch opened this issue 2 years ago • 6 comments

Has anyone considered adapting llama_multiprocess to run on multiple machines instead of multiple processes? I've started by using the SystemCommunicator from rsmpi library to replace nccl::Comm, but the debugging seems pretty obscure. I managed to get it running but the last step hung forever. I can share more details.

b0xtch avatar Sep 20 '23 00:09 b0xtch

Oh nice, and why not using NCCL for multi node ?

I haven't checked but the bindings to NCCL are pretty agnostic it should be easy to setup. I know NCCL is not the solution to everything, so it would still be very interesting to see what rsmpi could bring !

Narsil avatar Sep 20 '23 15:09 Narsil

Oh nice, and why not using NCCL for multi node ?

I haven't checked but the bindings to NCCL are pretty agnostic it should be easy to setup. I know NCCL is not the solution to everything, so it would still be very interesting to see what rsmpi could bring !

Oh! Can i use nccl for cross node communication? I was just leaning into rsmpi to serve as my cross node and then nccl would be for the actual compute.

Here is some really rough code: https://github.com/b0xtch/octo/tree/main/src

b0xtch avatar Sep 20 '23 16:09 b0xtch

Can i use nccl for cross node communication?

It's one of the big selling points ! https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ I'm not a big user of multi nodes but it should work pretty much out of the box. (There's a lot of configurations that goes into tweaking for performance)

Narsil avatar Sep 20 '23 16:09 Narsil

Can i use nccl for cross node communication?

It's one of the big selling points ! https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ I'm not a big user of multi nodes but it should work pretty much out of the box. (There's a lot of configurations that goes into tweaking for performance)

This is great, thanks. I’ll update here after I tried a few things

b0xtch avatar Sep 20 '23 17:09 b0xtch

@b0xtch Did you manage to get llama_multiprocess running on a multi node setup with NCCL?

Garbaz avatar Jun 22 '24 16:06 Garbaz

@b0xtch Did you manage to get llama_multiprocess running on a multi node setup with NCCL?

I started it a while back, but I have been blocked by the flash attention problem on cuda > v12.x. I haven't been able to finish testing it. Once I fix the attention problem, I'll get back to this.

I created a draft PR for now: https://github.com/huggingface/candle/pull/2292

b0xtch avatar Jun 28 '24 04:06 b0xtch