mlx-examples
mlx-examples copied to clipboard
Enable distributed LoRA training
The updates to LORA.md are missing but TL;DR we can now do
$ echo "m2-ultra-0 slots=1" >>hostfile
$ echo "m2-ultra-1 slots=1" >>hostfile
$ mpirun --hostfile hostfile -- python -m mlx_lm.lora --train --model mlx-community/Mistral-7B-v0.2-4bit --data /path/to/data --batch-size 16
to train across two nodes (or more really nothing needs to change).
Is that possible to do distributed inference as well?
Is that possible to do distributed inference as well?
Possible yes, but getting a nice speedup is more challenging. That's something we're looking at, but don't have an ETA on right now.
@awni feel free to review and then we can merge. I split the launcher to a different branch.
This works perfectly! Great job 👏