Tong Zhu (朱桐)
Tong Zhu (朱桐)
For toolkit usage errors, you must strictly follow the `Toolkit usage` issue template to open a new issue. 对于使用时报错等工具使用类的问题,必须严格使用 `Toolkit usage` issue 模板进行提问。 Otherwise, your issue may be closed directly...
From [u/biadelatrixyaska @ reddit](https://www.reddit.com/r/MachineLearning/comments/rkewa3/d_what_are_your_machine_learning_superstitions/?utm_source=share&utm_medium=web2x&context=3), 42 is a good choice for a default setting. Maybe there are more *best default random seeds*, and we should add these seeds as a default...
Here, the dimention in `cheap_embed` is 4-dimentional tensors: https://github.com/cg123/mergekit/blob/d55f654c2e70d3ac4ad6532de96e266aff2de931/mergekit/scripts/mixtral_moe.py#L87 However, the `gate_vec` receive a 3-dimentional tensor. https://github.com/cg123/mergekit/blob/d55f654c2e70d3ac4ad6532de96e266aff2de931/mergekit/scripts/mixtral_moe.py#L158-L161
** Problems ** Loss weight in the paper may be mismatched with the code. Further checks and explanations TBD.
The link to _A Review of Sparse Expert Models in Deep Learning_ should be `https://arxiv.org/abs/2209.01667`
## 🐛 Bug Report The loss does not go down and get convergence to get a valid reproduction result. ## 🔬 How To Reproduce Steps to reproduce the behavior: 1....
### Content My request is ... ### Code of Conduct - [X] I agree to follow this project's Code of Conduct
### Content My request is ... ### Code of Conduct - [X] I agree to follow this project's Code of Conduct
### Content My request is ... ### Code of Conduct - [X] I agree to follow this project's Code of Conduct
### Content My request is suggested as in the title. ### Code of Conduct - [X] I agree to follow this project's Code of Conduct