Megatron-LM issues

4

**Is your feature request related to a problem? Please describe.** As far as I know, the current distributed optimizer of megatron-lm implements zero1, but zero1 does not save enough GPU...

hwdef

stale

[BUG] Faiss RuntimeError

1

**Describe the bug** I am running [step 3](https://github.com/NVIDIA/Megatron-LM/blob/InstructRetro/tools/retro/build_db.md#step-3-build-index-for-similarity-search) on one 80G A100 GPU to "Build index for similarity search". My "DATA_BLEND" is the first 10000 scraped text items from openwebtext...

zhentingqi

stale

[QUESTION] Validation loss & PPL keep going up

Hi, so I was training 345m GPT2 using your example scripts `examples/pretrain_gpt.sh`. The validation loss and PPL, however, keep going up, while the training loss decreases as expected. ![image](https://github.com/NVIDIA/Megatron-LM/assets/140472590/ab8fb941-d9d0-4def-b7cf-71659f5bf6af) My...

zhentingqi

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

2

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115 ##...

ftgreat

[QUESTION] Why megatron-core seems slower and use more gpu mem than legacy for gpt_pretrain?

1

**Your question** I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model. I notice clearly **worse wall clock time...

REIGN12

adjust the keys of attention in checkpoint

2

I try to convert gpt checkpoint from **local** to **transformer_engine** according to following map ` { 'input_layernorm.': 'self_attention.linear_qkv.layer_norm_', 'pre_mlp_layernorm.': 'mlp.linear_fc1.layer_norm_', } ` It works well only when the optimizer is...

xiaojunjie

stale

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

1

If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call `optimizer.step() ` the two optimizers perform grad-norm for their own parameters....

ezioliao

Add timeout parameter for `new_group`

3

This PR aims to pass `timeout` parameters to `new_group` function. Previously, the ProcessGroups created by `new_group` does not set `timeout` parameters, which would make the communications under these ProcessGroup uses...

acphile

[QUESTION] Status & Plan for Distributed Checkpointing in Megatron Repo

5

Hi there, I noticed that distributed checkpointing was recently added in the repo under megatron/core/dist_checkpointing directory. From the current implementation available, I find it a good match for my use...

thuhujin

stale

Megatron-LM
Megatron-LM copied to clipboard

Metadata

modifed the model parreleized gpt pre-trainign script

[ENHANCEMENT] support zero 2 distributed optimize

[BUG] Faiss RuntimeError

[QUESTION] Validation loss & PPL keep going up

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

[QUESTION] Why megatron-core seems slower and use more gpu mem than legacy for gpt_pretrain?

adjust the keys of attention in checkpoint

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

Add timeout parameter for `new_group`

[QUESTION] Status & Plan for Distributed Checkpointing in Megatron Repo

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard