Tong Zhu (朱桐) comments

Results 45 comments of


                                            Tong Zhu (朱桐)

Loss does not go down

> @Spico197 Since it looks like you're running the given code so it may not be as relevant, but I wanted to share that I was able to reduce loss...

Loss does not go down

@Aatlantise Thank you very much for your sharing! I'm planning to rewrite the model from scratch and see if there's a performance difference. I'll update on this thread if there's...

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

感谢您对本项目的关注❤️ 1. LLaMA-MoE基于一个完整的llama模型进行切分 2. 是的。我们只切分了llama的FFN层，之后再加一个gate进行token路由 3. 目前不支持。不过这个想法基于同构模型，我想是比较容易实现的 Hi there, thanks for your attention on this project❤️ 1. LLaMA-MoE is constructed on ONE llama2-7B model 4. Yes, you are right. We...

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the...

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

Hi there~ For multi-stage pre-training comparison, it takes about 20B tokens. It may take about 20~30B tokens to reach a relative low loss values (2.1). But 20B tokens for gate...

#Feature Request# Accelerated Deployment.

Hi there, thanks for the question~ It is possible to use these frameworks for LLaMA-MoE inference acceleration, and we are working on it. It may take some time for development...

About the released weights

Hi there, are you planning to re-train the model with filtered dataset? Can you release smaller model's weights for us to have fun? It's really frustrating to wait for such...

Some questions on scripts and runtime

Hi there, sorry for the late response. Thank you very much for your attention to our project ❤️ 1. For LLaMA-MoE-3.5B (2/8), it costs about 1 week to reproduce the...

parameter of weight_gates are not initialized from huggingface checkpoint

Hi there, thanks for using the released code and model weights~ It seems the `model.layers.0.mlp.calculator.experts.parametrizations.weight.original0ate.0` is not a normal parameter name in the model structure. Could you please provide the...

parameter of weight_gates are not initialized from huggingface checkpoint

> I encountered the same problem (crying). > Thanks for ur info. No ideas yet. Maybe a potential bug somewhere. I'll setup a new environment to reproduce the problem this...