Results 7 issues of ldwang

请问可以cpu-ps方式运行吗

## ❓ Questions and Help ### Before asking: 1. search the issues. 2. search the docs. #### What is your question? Use metaseq-train script, when training finished, checkpoint shards are...

question

Only using command: mergekit-moe xxx.yml output, errors happened as follows. We just intend to merge models into moe. Thanks for your advice. ` base_model: A gate_mode: hidden dtype: bfloat16 experts:...

could you please give me advice about mu parametrization for gated-mlp and group-query attention, thanks very much. @thegregyang @edwardjhu

https://github.com/OpenBMB/ModelCenter/blob/main/examples/cpm2/pretrain_cpm2.py#L24 请问这里模型初始化是不是每卡都会执行? 如果模型很大,可能内存OOM。谢谢您的解答。

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115 ##...

Hi, thank you for your great work! Could you provide more details about the pretrain dataset? How has the pretrain dataset been optimized in DeepSeek-V2 compared to the previous version,...