megablocks
megablocks copied to clipboard
1-expert worse than dense model
I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can I expect them to be the same? Thanks!