megablocks
megablocks copied to clipboard
different load_balancing_loss with different pipeline_parallel_size
I load the same model trained with megatron + megablocks, and I found the load_balancing_loss is slightly different. When I increase the pipeline_parallel_size, the load_balancing_loss is also increasing. Is it just the problem of precision or there is a potential bug?
For example, when I train a 500M gpt model with 64 experts, I list the lbl and pp_size in the table below.
pp_size | lbl |
---|---|
1 | 1.005E-01 |
2 | 1.007E-01 |
4 | 1.013E-01 |