Question regarding the load importance loss calculation
Hi, when studying the load importance loss, I found the parameters passed to the function load_importance_loss are softmax normalized scores and logits with noise (see moe_layer.py L281). I am wondering why we use the softmax normalized score to calculate the diff against the raw logits with noise? Why not consistently use the softmax output for both? Thanks!
Hi, the standard GShard MoE follows the branch self.is_gshard_loss == True, while the loss option you pointed out is designed and preferred by Swin-Transformer MoE.
According to load_importance_loss defined in https://github.com/microsoft/tutel/blob/main/tutel/impls/losses.py#L29, it requires normalization to perform directly on the score tensor without doing noise which avoids normalization results to be polluted by the noise. @zeliu98