DeepSeekDDM
DeepSeekDDM
Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from...
> @DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated...
> @DeepSeekDDM 确认一下,deepseek v2 实现的是device 维度的drop token 对于 device 维度 去做drop ,是将当前device 所有的expert 分数统一做个排序 然后drop ? Yes. The actual dropping strategy is a little complex, but the main idea is...
> @DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗?比较好奇 Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE.