megablocks icon indicating copy to clipboard operation
megablocks copied to clipboard

different load_balancing_loss with different pipeline_parallel_size

Open bozheng-hit opened this issue 1 year ago • 8 comments

I load the same model trained with megatron + megablocks, and I found the load_balancing_loss is slightly different. When I increase the pipeline_parallel_size, the load_balancing_loss is also increasing. Is it just the problem of precision or there is a potential bug?

For example, when I train a 500M gpt model with 64 experts, I list the lbl and pp_size in the table below.

pp_size lbl
1 1.005E-01
2 1.007E-01
4 1.013E-01

bozheng-hit avatar Jan 05 '24 06:01 bozheng-hit