Megatron-LM
Megatron-LM copied to clipboard
Ongoing research training transformer models at scale
**Is your feature request related to a problem? Please describe.** As far as I know, the current distributed optimizer of megatron-lm implements zero1, but zero1 does not save enough GPU...
**Describe the bug** I am running [step 3](https://github.com/NVIDIA/Megatron-LM/blob/InstructRetro/tools/retro/build_db.md#step-3-build-index-for-similarity-search) on one 80G A100 GPU to "Build index for similarity search". My "DATA_BLEND" is the first 10000 scraped text items from openwebtext...
Hi, so I was training 345m GPT2 using your example scripts `examples/pretrain_gpt.sh`. The validation loss and PPL, however, keep going up, while the training loss decreases as expected.  My...
During continuing training MoE models(loading existing ckpt), at some steps, assert errors occurred as follows: "found NaN in local grad norm in backward pass before data-parallel communication collective". https://github.com/NVIDIA/Megatron-LM/blob/caf2007e080d65dd7488be7bd409b366e225ab5f/megatron/core/distributed/param_and_grad_buffer.py#L115 ##...
**Your question** I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model. I notice clearly **worse wall clock time...
I try to convert gpt checkpoint from **local** to **transformer_engine** according to following map ` { 'input_layernorm.': 'self_attention.linear_qkv.layer_norm_', 'pre_mlp_layernorm.': 'mlp.linear_fc1.layer_norm_', } ` It works well only when the optimizer is...
If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call `optimizer.step() ` the two optimizers perform grad-norm for their own parameters....
This PR aims to pass `timeout` parameters to `new_group` function. Previously, the ProcessGroups created by `new_group` does not set `timeout` parameters, which would make the communications under these ProcessGroup uses...
Hi there, I noticed that distributed checkpointing was recently added in the repo under megatron/core/dist_checkpointing directory. From the current implementation available, I find it a good match for my use...