gpt-neox
gpt-neox copied to clipboard
Add megablocks dropless MoE
This initial version focuses on getting megablocks integrated and working with DS parallelism. It makes megablocks experts work within the existing parallelism, which has the full degrees of freedom including expert, expert-data, and tensor-expert-data parallelism.
Tested on 8xA100 for convergence, expert balancing, and uncovered weight initialization issues (to be fixed later).
Design document and worklog that accompanied this project: https://yaaang.notion.site/gpt-neox-MoE-design-doc-cc8586eb53144a5987b63f510ced021c
In terms of where this fits larger arcs of work, next PRs (don't have permission to submit stacked PRs) are for:
- improved expert initialization like we discussed
- adding integration tests around this that automate the verification I was showing earlier around convergence and expert + router gradients
- making it work with DS pipeline parallelism
- merging with Colin's code and doing the megablocks code fork