LLaMA-MoE-v2 Inquiry About K-Means Initialization for Gates Without Fine-Tuning

Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the performance when using this method directly.

I am curious to know if, when using the parameters obtained directly from K-Means initialization and testing the model without any fine-tuning, the PPL (perplexity) would be affected. Could you please provide some insights on this?

Thanks again for your time.

Dec 16 '24 10:12 pprp

@DaizeDong maybe you can help

Dec 16 '24 10:12 XiaoYee

Thank your for your attention to our project! That is a very good question!

Unfortunately, according to our observation, the converted model w/o further finetuning is kind of "broken", i.e., the PPL is very high. I conjecture this is due to the poor sparsity of mordern LLMs, which are usually over-trained on huge amount of tokens. So your insight on directly using the model obtained from K-Means may not be very effective. However, for models like BERT (which is smaller and uses ReLU as the activation), this may worth a try.

Dec 16 '24 11:12 DaizeDong

@DaizeDong Thanks for your swift reply.

I wonder whether this K-means method can make the gate converge faster than randomly initialized weights?

Thanks!

Dec 17 '24 02:12 pprp

@pprp Sorry that we didn't conduct experiments on ablating the initialization method of the gate weights. However, this method can lead to better balancedness at the initial, and I believe it can also accelerate the convergence. If you are interested, I think I can experiment on it and maybe we can collaborate for deeper investigations on gate initialization strategies.

Dec 17 '24 03:12 DaizeDong

@DaizeDong I conducted experiment over it and here are the results:

when employing the k-means initialization, we got the following results:

And here are the random initialization's results:

The only difference is whether we use --gate_weights_file

Dec 18 '24 09:12 pprp

@pprp Your images show that both models suffer great performance loss after initialization, and this observation aligns with ours. I think you need to train the models with more tokens to compare the convergence rates of these two methods.

Dec 19 '24 05:12 DaizeDong