Enable distributed LoRA training

Open angeloskath opened this issue 1 year ago • 2 comments

The updates to LORA.md are missing but TL;DR we can now do

$ echo "m2-ultra-0 slots=1" >>hostfile
$ echo "m2-ultra-1 slots=1" >>hostfile
$ mpirun --hostfile hostfile -- python -m mlx_lm.lora --train --model mlx-community/Mistral-7B-v0.2-4bit --data /path/to/data --batch-size 16

to train across two nodes (or more really nothing needs to change).

Jun 06 '24 15:06 angeloskath

Is that possible to do distributed inference as well?

Jul 07 '24 06:07 mzbac

Is that possible to do distributed inference as well?

Possible yes, but getting a nice speedup is more challenging. That's something we're looking at, but don't have an ETA on right now.

Jul 07 '24 13:07 awni

@awni feel free to review and then we can merge. I split the launcher to a different branch.

Oct 31 '24 22:10 angeloskath

This works perfectly! Great job 👏

Nov 17 '24 12:11 ivanfioravanti