Qwen2.5 icon indicating copy to clipboard operation
Qwen2.5 copied to clipboard

Some questions about shared_expert_gate

Open cooper12121 opened this issue 10 months ago • 1 comments

Compare to deepseek-moe, your model add an additional learnable parameter self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False), to control the proportion of the results of shared experts. There are somethings i want to know:

  • the change in shared_expert_gate value during pretrain;
  • the range of shared_expert_gate value during inference;
  • whether the gate value varies greatly among different tasks;
  • Have you tested the effect of setting specific shared_expert_gate values on the performance of the model;
  • Have you tested the effect of dropping shared_expert_gate on the model performance.

I'm curious about the impact of shared_expert_gate and looking forward to your reply!

cooper12121 avatar Apr 07 '24 03:04 cooper12121

sry temporarily we are not about to release the details. stay tuned for the coming tech report.

JustinLin610 avatar Apr 07 '24 07:04 JustinLin610