Question regarding the load importance loss calculation

Open wangyirui opened this issue 1 year ago • 1 comments

Hi, when studying the load importance loss, I found the parameters passed to the function load_importance_loss are softmax normalized scores and logits with noise (see moe_layer.py L281). I am wondering why we use the softmax normalized score to calculate the diff against the raw logits with noise? Why not consistently use the softmax output for both? Thanks!

Jun 15 '24 14:06 wangyirui

Hi, the standard GShard MoE follows the branch self.is_gshard_loss == True, while the loss option you pointed out is designed and preferred by Swin-Transformer MoE.

According to load_importance_loss defined in https://github.com/microsoft/tutel/blob/main/tutel/impls/losses.py#L29, it requires normalization to perform directly on the score tensor without doing noise which avoids normalization results to be polluted by the noise. @zeliu98

Jun 15 '24 15:06 ghostplant