Qwen2.5
Qwen2.5 copied to clipboard
Some questions about shared_expert_gate
Compare to deepseek-moe, your model add an additional learnable parameter
self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)
,
to control the proportion of the results of shared experts. There are somethings i want to know:
- the change in shared_expert_gate value during pretrain;
- the range of shared_expert_gate value during inference;
- whether the gate value varies greatly among different tasks;
- Have you tested the effect of setting specific shared_expert_gate values on the performance of the model;
- Have you tested the effect of dropping shared_expert_gate on the model performance.
I'm curious about the impact of shared_expert_gate and looking forward to your reply!
sry temporarily we are not about to release the details. stay tuned for the coming tech report.