Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Ongoing research training transformer models at scale

Results 294 Megatron-LM issues
Sort by recently updated
recently updated
newest added

**Is your feature request related to a problem? Please describe.** As far as I know, the current distributed optimizer of megatron-lm implements zero1, but zero1 does not save enough GPU...

stale

**Describe the bug** I am running [step 3](https://github.com/NVIDIA/Megatron-LM/blob/InstructRetro/tools/retro/build_db.md#step-3-build-index-for-similarity-search) on one 80G A100 GPU to "Build index for similarity search". My "DATA_BLEND" is the first 10000 scraped text items from openwebtext...

stale

Hi, so I was training 345m GPT2 using your example scripts `examples/pretrain_gpt.sh`. The validation loss and PPL, however, keep going up, while the training loss decreases as expected. ![image](https://github.com/NVIDIA/Megatron-LM/assets/140472590/ab8fb941-d9d0-4def-b7cf-71659f5bf6af) My...

During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115 ##...

**Your question** I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model. I notice clearly **worse wall clock time...

I try to convert gpt checkpoint from **local** to **transformer_engine** according to following map ` { 'input_layernorm.': 'self_attention.linear_qkv.layer_norm_', 'pre_mlp_layernorm.': 'mlp.linear_fc1.layer_norm_', } ` It works well only when the optimizer is...

stale

If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call `optimizer.step() ` the two optimizers perform grad-norm for their own parameters....

This PR aims to pass `timeout` parameters to `new_group` function. Previously, the ProcessGroups created by `new_group` does not set `timeout` parameters, which would make the communications under these ProcessGroup uses...

Hi there, I noticed that distributed checkpointing was recently added in the repo under megatron/core/dist_checkpointing directory. From the current implementation available, I find it a good match for my use...

stale