Qwen2.5 Some questions about shared_expert

Some questions about shared_expert_gate

Open cooper12121 opened this issue 10 months ago • 1 comments

Compare to deepseek-moe, your model add an additional learnable parameter self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False), to control the proportion of the results of shared experts. There are somethings i want to know:

the change in shared_expert_gate value during pretrain;
the range of shared_expert_gate value during inference;
whether the gate value varies greatly among different tasks;
Have you tested the effect of setting specific shared_expert_gate values on the performance of the model;
Have you tested the effect of dropping shared_expert_gate on the model performance.

I'm curious about the impact of shared_expert_gate and looking forward to your reply!

Apr 07 '24 03:04 cooper12121

sry temporarily we are not about to release the details. stay tuned for the coming tech report.

Apr 07 '24 07:04 JustinLin610

Qwen2.5 Qwen2.5 copied to clipboard

Some questions about shared_expert_gate

Qwen2.5
Qwen2.5 copied to clipboard