fastmoe icon indicating copy to clipboard operation
fastmoe copied to clipboard

Adding Expert Prototyping to FastMoE

Open JustinLin610 opened this issue 3 years ago • 1 comments

Hi, thanks for your provding end-to-end training framework in Pytorch for MoE models. We have recently implemented MoE in tensorflow and found out that categorizing experts to different groups can bring improvements in model quality. More details can be referred to our paper https://arxiv.org/abs/2105.15082. I wonder if it is possible to add this feature as FastMoE really facilitates research in sparse expert models.

Generally, this strategy categorizes experts to different groups, each of which has its own gating function for routing. It is compatible with the conventional routing method like Switch or top-2 routing as you can set the group number to 1. We find that increasing the value of k in top-k can increase model performance and k top-1 can achieve similar effect. Also, it is possible to try out more complex strategies, say k top-k' or so.

We have a code snippet in the appendix, which may be helpful.

JustinLin610 avatar Aug 23 '21 02:08 JustinLin610

Here is another recent work about MoE.

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning https://arxiv.org/abs/2106.03760

The idea is to activate all experts at the beginning of training, but quickly converge to sparse activation. I wonder whether such mechanism can help train better pre-trained models when our expert pool is not that large.

Let me know how do you think about it?

xptree avatar Sep 06 '21 02:09 xptree