For smoe, set the number of experts N to 3 and the number of selected experts to 2, why can ensure the computational cost is similar ?

Open zenghao-zh opened this issue 1 year ago • 0 comments

In paper

``For the MoE layers, we set the number of experts N to 32 for MoE-Dropout and SSD. MoE-Dropout linearly increases the number of selected experts K from 6 to 32 during the pre-training. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps. In the sparse mode, we also select 6 experts for each layer. The ratio of the sparse mode r is set to 0.5. The ratio of the final dense training l is set to 0.1. For SMoE, we set the number of experts N to 3 and the number of selected experts K to 2 to ensure the computational cost is similar to that of other methods.''

Why for ssd, you select 6 experts from 32 experts while in SMoE, selects 2 from 3?

Nov 22 '24 01:11 zenghao-zh