LLaMA-MoE-v2
LLaMA-MoE-v2 copied to clipboard
🚀 LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training
Hello authors. I tried to train LLaMA-MLP-MoE (2/8). After two stages of training, the model cannot output normal sentences. The inference script is as follows: ```python model_dir = "" tokenizer...
Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the...
## What's New Add [megablocks](https://github.com/[databricks/megablocks](https://github.com/databricks/megablocks)) support for MLP MoE. Dumping & Reloading test is passed by observing the continuous loss decline. But further downstream metrics are not tested. Please use...