DeepSeek-V2 icon indicating copy to clipboard operation
DeepSeek-V2 copied to clipboard

`V-MoE` token droping and `MoD`

Open liyucheng09 opened this issue 9 months ago • 8 comments

This token dropping method, as indicated by the citation, is based on the V-MoE method.

How this different from the recent MoD? It look like they very similar techniques.

liyucheng09 avatar May 07 '24 02:05 liyucheng09

Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from MoD, so I do not understand your question. Can you give me more information about your understanding of our token-dropping strategy and MoD? Maybe we can find out something misunderstood.

DeepSeekDDM avatar May 14 '24 07:05 DeepSeekDDM

@DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity. If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case? Because the paper mentions device-level token dropping, I have the above confusion.

Richie-yan avatar May 29 '24 11:05 Richie-yan

Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this? @DeepSeekDDM @luofuli

Richie-yan avatar May 29 '24 11:05 Richie-yan

@DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity. If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case? Because the paper mentions device-level token dropping, I have the above confusion.

A to Q1: Mainly on the device dimension. A to Q2: Yes. A to Q3: Also drop tokens with the lowest prob. A to Q4 & Q5: Yes, we implement a specific strategy to ensure this.

DeepSeekDDM avatar May 30 '24 02:05 DeepSeekDDM

@DeepSeekDDM 确认一下,deepseek v2 实现的是device 维度的drop token 对于 device 维度 去做drop ,是将当前device 所有的expert 分数统一做个排序 然后drop ?

Richie-yan avatar May 30 '24 11:05 Richie-yan

@DeepSeekDDM 确认一下,deepseek v2 实现的是device 维度的drop token 对于 device 维度 去做drop ,是将当前device 所有的expert 分数统一做个排序 然后drop ?

Yes. The actual dropping strategy is a little complex, but the main idea is what you described just now.

DeepSeekDDM avatar May 30 '24 11:05 DeepSeekDDM

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗?比较好奇

Richie-yan avatar May 30 '24 12:05 Richie-yan

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗?比较好奇

Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE.

DeepSeekDDM avatar May 31 '24 02:05 DeepSeekDDM