Inquiry About K-Means Initialization for Gates Without Fine-Tuning
Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the performance when using this method directly.
I am curious to know if, when using the parameters obtained directly from K-Means initialization and testing the model without any fine-tuning, the PPL (perplexity) would be affected. Could you please provide some insights on this?
Thanks again for your time.
@DaizeDong maybe you can help
Thank your for your attention to our project! That is a very good question!
Unfortunately, according to our observation, the converted model w/o further finetuning is kind of "broken", i.e., the PPL is very high. I conjecture this is due to the poor sparsity of mordern LLMs, which are usually over-trained on huge amount of tokens. So your insight on directly using the model obtained from K-Means may not be very effective. However, for models like BERT (which is smaller and uses ReLU as the activation), this may worth a try.
@DaizeDong Thanks for your swift reply.
I wonder whether this K-means method can make the gate converge faster than randomly initialized weights?
Thanks!
@pprp Sorry that we didn't conduct experiments on ablating the initialization method of the gate weights. However, this method can lead to better balancedness at the initial, and I believe it can also accelerate the convergence. If you are interested, I think I can experiment on it and maybe we can collaborate for deeper investigations on gate initialization strategies.
@DaizeDong I conducted experiment over it and here are the results:
when employing the k-means initialization, we got the following results:
And here are the random initialization's results:
The only difference is whether we use --gate_weights_file
@pprp Your images show that both models suffer great performance loss after initialization, and this observation aligns with ours. I think you need to train the models with more tokens to compare the convergence rates of these two methods.