DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] Expert Choice Routing for MoE

Open clumsy opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. A paper was published regarding potentially better token-expert routing for MoE that leaves less experts under-trained.

Describe the solution you'd like In addition to GShard's top2 and SwitchTransformer's top1 per token expert routing add expert choice routing option.

Describe alternatives you've considered N/A

Additional context N/A

clumsy avatar Nov 17 '22 18:11 clumsy

The authors claim 2x convergence rate with EC routing: https://ai.googleblog.com/2022/11/mixture-of-experts-with-expert-choice.html

I hope this incentivizes implementing it in DeepSpeed.

clumsy avatar Feb 13 '23 16:02 clumsy

Thank you @clumsy for sharing this paper.

@ykim362, have you seen this paper? Is anyone in your team or any interns interested in implementing this feature?

awan-10 avatar Feb 14 '23 18:02 awan-10

In case this helps, TL;DR is in Lilian Weng's blog post.

clumsy avatar Feb 17 '23 00:02 clumsy

Hi @awan-10 . I have an implementation of this paper. But, we didn't see the gains mentioned in the paper. Actually, the accuracy was quite worse than the original top-1 and top-2 gating.

@clumsy have you actually done any experiments with this expert choice gating?

ykim362 avatar Feb 17 '23 21:02 ykim362

No @ykim362, but I would like to experiment with it and share the results. Is it possible to share the snippet with the implementation you used?

clumsy avatar Feb 22 '23 16:02 clumsy

@clumsy you can take a look at this experimental branch. https://github.com/ykim362/DeepSpeed/tree/youki/expc

ykim362 avatar Jul 10 '23 23:07 ykim362

hey, google has implementation of expert choice routing here: https://github.com/google/flaxformer/blob/main/flaxformer/architectures/moe/routing.py#L647-L717

They have a note that it should not be used in decoder blocks, maybe that was reason for poor results during your experiments?

ilyalasy avatar Mar 22 '24 13:03 ilyalasy