llm.c Support MPI distributed training

Apr 10 '24 00:04 sequoiar

I have this in mind for the Mojo target issue - which is really about having the Makefile support composability like the one for llama.cpp. Probably copy-pasta most of what llama.cpp has so the build is using mpicc. Would still need to write the MPI code.

Apr 10 '24 15:04 chadbrewbaker

definitely! but this is pretty far down the line, i think we first need to get the 1-GPU version to be super solid.

Apr 10 '24 18:04 karpathy

I regularly write MPI code, so this shouldn't be too complicated to implement. I've started to look though the CPU version to get started. However, I do have questions regarding the ML side.

There a few options I can see:

Data Parallelism using MPI_Allreduce to average gradients - I think we would do this around here: - https://github.com/karpathy/llm.c/blob/master/train_gpt2.c#L906C1-L906C5
Tensor parallelism (similar to lamma.cpp)
Model Parallelism

Is there preference to how this could be scaled with MPI? If option 2 or 3, seem like the best option, do you have a suggestion as to where in the code I should dig into?

Apr 12 '24 23:04 Yiltan

Sounds great! I expect to get started with the backward pass somewhere over the weekend most likely. (I spent today optimizing the forward pass still) Once we have the backward pass getting data parallel training in will be super awesome

Apr 13 '24 00:04 karpathy

I would do MPI-2 as MPI IO is all you need and it is most widely supported.

Apr 14 '24 03:04 chadbrewbaker

llm c_train

The MPI version of this is mostly working at this point. I've tested it up to 8 nodes. It reduces training by many hours.

@karpathy Do you still have interest in a NCCL version? If so, are resources for multi-GPU resource that you could share?

Apr 30 '24 13:04 Yiltan