megablocks icon indicating copy to clipboard operation
megablocks copied to clipboard

1-expert worse than dense model

Open Muennighoff opened this issue 9 months ago • 1 comments

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can I expect them to be the same? Thanks!

Screenshot 2024-05-08 at 10 09 05 AM

Muennighoff avatar May 08 '24 17:05 Muennighoff