Support MPI distributed training
I have this in mind for the Mojo target issue - which is really about having the Makefile support composability like the one for llama.cpp. Probably copy-pasta most of what llama.cpp has so the build is using mpicc. Would still need to write the MPI code.
definitely! but this is pretty far down the line, i think we first need to get the 1-GPU version to be super solid.
I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding the ML side.
There a few options I can see:
- Data Parallelism using MPI_Allreduce to average gradients - I think we would do this around here: - https://github.com/karpathy/llm.c/blob/master/train_gpt2.c#L906C1-L906C5
- Tensor parallelism (similar to lamma.cpp)
- Model Parallelism
Is there preference to how this could be scaled with MPI? If option 2 or 3, seem like the best option, do you have a suggestion as to where in the code I should dig into?
Sounds great! I expect to get started with the backward pass somewhere over the weekend most likely. (I spent today optimizing the forward pass still) Once we have the backward pass getting data parallel training in will be super awesome
I would do MPI-2 as MPI IO is all you need and it is most widely supported.
The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours.
@karpathy Do you still have interest in a NCCL version? If so, are resources for multi-GPU resource that you could share?