gpt-neox Add megablocks dropless MoE

Add megablocks dropless MoE

Open yang opened this issue 1 year ago • 0 comments

This initial version focuses on getting megablocks integrated and working with DS parallelism. It makes megablocks experts work within the existing parallelism, which has the full degrees of freedom including expert, expert-data, and tensor-expert-data parallelism.

Tested on 8xA100 for convergence, expert balancing, and uncovered weight initialization issues (to be fixed later).

Design document and worklog that accompanied this project: https://yaaang.notion.site/gpt-neox-MoE-design-doc-cc8586eb53144a5987b63f510ced021c

In terms of where this fits larger arcs of work, next PRs (don't have permission to submit stacked PRs) are for:

improved expert initialization like we discussed
adding integration tests around this that automate the verification I was showing earlier around convergence and expert + router gradients
making it work with DS pipeline parallelism
merging with Colin's code and doing the megablocks code fork

Mar 22 '24 00:03 yang

gpt-neox gpt-neox copied to clipboard

Add megablocks dropless MoE

gpt-neox
gpt-neox copied to clipboard