DeepSeekDDM comments

Results 4 comments of


                                            DeepSeekDDM

`V-MoE` token droping and `MoD`

Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from...

`V-MoE` token droping and `MoD`

> @DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated...

`V-MoE` token droping and `MoD`

> @DeepSeekDDM 确认一下，deepseek v2 实现的是device 维度的drop token 对于 device 维度去做drop ，是将当前device 所有的expert 分数统一做个排序然后drop ？ Yes. The actual dropping strategy is a little complex, but the main idea is...

`V-MoE` token droping and `MoD`

> @DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗？比较好奇 Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE.