st-moe-pytorch
st-moe-pytorch copied to clipboard
Seeking Help on Loss Behavior
First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some changes in the losses as shown in the graph below. I have a few questions and would like to seek your guidance on whether these situations are normal:
-
For the
balance_loss, it briefly increases and then stabilizes around 5.0 without decreasing. How can I verify if the experts have achieved balance in this case? -
The
aux_loss, which is the sum ofweighted_balance_lossandweighted_router_z_loss, seems to have a relatively small contribution to the overallloss. Although it is indeed decreasing, should I increase the values of the twocoefin your code? -
Is there a recommended
batch_sizefor training MoE? I have noticed that differentbatch_sizevalues yield different results. Thebatch_sizementioned in the ST-MoE paper is too large for individual users like me to refer to.