megablocks 1-expert worse than dense model

1-expert worse than dense model

Open Muennighoff opened this issue 9 months ago • 1 comments

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can I expect them to be the same? Thanks!

May 08 '24 17:05 Muennighoff

megablocks megablocks copied to clipboard

1-expert worse than dense model

megablocks
megablocks copied to clipboard