st-moe-pytorch Seeking Help on Loss Behavior

Seeking Help on Loss Behavior

Open guanidine opened this issue 1 year ago • 0 comments

First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some changes in the losses as shown in the graph below. I have a few questions and would like to seek your guidance on whether these situations are normal:

For the balance_loss, it briefly increases and then stabilizes around 5.0 without decreasing. How can I verify if the experts have achieved balance in this case?
The aux_loss, which is the sum of weighted_balance_loss and weighted_router_z_loss, seems to have a relatively small contribution to the overall loss. Although it is indeed decreasing, should I increase the values of the two coef in your code?
Is there a recommended batch_size for training MoE? I have noticed that different batch_size values yield different results. The batch_size mentioned in the ST-MoE paper is too large for individual users like me to refer to.

Feb 01 '24 09:02 guanidine

st-moe-pytorch st-moe-pytorch copied to clipboard

Seeking Help on Loss Behavior

st-moe-pytorch
st-moe-pytorch copied to clipboard