Mahmoud Shehata
Mahmoud Shehata
This can be split up into different issues, especially the ones that aren't currently actionable
> There's 2 different ways to do multi-gpu: multi-device and multi-host. We'll need to do both to truly reach llm scale training, but we should start with multi-device. > >...
> jorgeantonio21 Determinism is essential. Given the project I am working on, cross-platform consistency and verifiable communication (moot) across nodes are paramount. Given this paper [Agatha](https://dl.acm.org/doi/10.1145/3340531.3412684), here are some conclusions...
> > * I am blocked on the `flash_attn` on cuda >`12.X` so I haven't been able to finish this > > Are you sure it's blocked on these versions?...
> Oh nice, and why not using NCCL for multi node ? > > I haven't checked but the bindings to NCCL are pretty agnostic it should be easy to...
> > Can i use nccl for cross node communication? > > It's one of the big selling points ! https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/ I'm not a big user of multi nodes but...
> @b0xtch Did you manage to get llama_multiprocess running on a multi node setup with NCCL? I started it a while back, but I have been blocked by the flash...
Related issue: https://github.com/huggingface/candle/issues/2007 There was an attempt to do tensor parallelism: https://github.com/EricLBuehler/mistral.rs/pull/72
Amazing stuff! The tensor parallelism I am guessing will be on the core candle repo? or do you plan to abstract that in some way under this repo? I have...
Yeah noticed that same with groq mistral model as well