st-moe-pytorch
st-moe-pytorch copied to clipboard
Seeking Help on Loss Behavior
First of all, thank you for your project, it looks great! I have been trying to apply it to ViT just like V-MoE. During the training process, I observed some changes in the losses as shown in the graph below. I have a few questions and would like to seek your guidance on whether these situations are normal:
-
For the
balance_loss
, it briefly increases and then stabilizes around 5.0 without decreasing. How can I verify if the experts have achieved balance in this case? -
The
aux_loss
, which is the sum ofweighted_balance_loss
andweighted_router_z_loss
, seems to have a relatively small contribution to the overallloss
. Although it is indeed decreasing, should I increase the values of the twocoef
in your code? -
Is there a recommended
batch_size
for training MoE? I have noticed that differentbatch_size
values yield different results. Thebatch_size
mentioned in the ST-MoE paper is too large for individual users like me to refer to.