ldwang issues

Results 9 issues of


                                            ldwang

Failure after loading checkpoint shards.

## ❓ Questions and Help ### Before asking: 1. search the issues. 2. search the docs. #### What is your question? Use metaseq-train script, when training finished, checkpoint shards are...

question

Only using command: mergekit-moe xxx.yml output, errors happened as follows. We just intend to merge models into moe. Thanks for your advice. ` base_model: A gate_mode: hidden dtype: bfloat16 experts:...

mu parametrization for gated-mlp and group-query attention

could you please give me advice about mu parametrization for gated-mlp and group-query attention, thanks very much. @thegregyang @edwardjhu

模型加载问题

https://github.com/OpenBMB/ModelCenter/blob/main/examples/cpm2/pretrain_cpm2.py#L24 请问这里模型初始化是不是每卡都会执行？如果模型很大，可能内存OOM。谢谢您的解答。

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115 ##...

ldwang

请问可以cpu-ps方式运行吗

Failure after loading checkpoint shards.

Just merge models

mu parametrization for gated-mlp and group-query attention

模型加载问题

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

About datasets

questions about evaluation like MMLU

Unexpected behavior when using sentence_dedup with split_sentences=True