Tong Zhu (朱桐)

Results 45 comments of Tong Zhu (朱桐)

> @Spico197 Since it looks like you're running the given code so it may not be as relevant, but I wanted to share that I was able to reduce loss...

@Aatlantise Thank you very much for your sharing! I'm planning to rewrite the model from scratch and see if there's a performance difference. I'll update on this thread if there's...

感谢您对本项目的关注❤️ 1. LLaMA-MoE基于一个完整的llama模型进行切分 2. 是的。我们只切分了llama的FFN层,之后再加一个gate进行token路由 3. 目前不支持。不过这个想法基于同构模型,我想是比较容易实现的 Hi there, thanks for your attention on this project❤️ 1. LLaMA-MoE is constructed on ONE llama2-7B model 4. Yes, you are right. We...

We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the...

Hi there~ For multi-stage pre-training comparison, it takes about 20B tokens. It may take about 20~30B tokens to reach a relative low loss values (2.1). But 20B tokens for gate...

Hi there, thanks for the question~ It is possible to use these frameworks for LLaMA-MoE inference acceleration, and we are working on it. It may take some time for development...

Hi there, are you planning to re-train the model with filtered dataset? Can you release smaller model's weights for us to have fun? It's really frustrating to wait for such...

Hi there, sorry for the late response. Thank you very much for your attention to our project ❤️ 1. For LLaMA-MoE-3.5B (2/8), it costs about 1 week to reproduce the...

Hi there, thanks for using the released code and model weights~ It seems the `model.layers.0.mlp.calculator.experts.parametrizations.weight.original0ate.0` is not a normal parameter name in the model structure. Could you please provide the...

> I encountered the same problem (crying). > Thanks for ur info. No ideas yet. Maybe a potential bug somewhere. I'll setup a new environment to reproduce the problem this...