Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

Ongoing research training transformer models at scale

Results 294 Megatron-LM issues
Sort by recently updated
recently updated
newest added

**Your question** Ask a clear and concise question about Megatron-LM. Can we have a sample idx + bin files as required by the pretrain_gpt.py ? Running tools/preprocess_data.py on some sample...

This minor patch is to fix typo in moe introduction document.

**Describe the bug** Hi, I think there is a bug when context parallel is on and we can discuss it. https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/pretrain_gpt.py#L148 From this [issue](https://github.com/NVIDIA/Megatron-LM/issues/673),i know the result is same for...

This func is wrong, the program will hang because of the "group" variable. def _batched_p2p_ops( *, tensor_send_prev: Optional[torch.Tensor], tensor_recv_prev: Optional[torch.Tensor], tensor_send_next: Optional[torch.Tensor], tensor_recv_next: Optional[torch.Tensor], group: torch.distributed.ProcessGroup ) after modified: def...

stale

# Batch_input and elapsed time per iteration slow down during model training ![微信图片编辑_20240629150957](https://github.com/EleutherAI/gpt-neox/assets/140717408/dae875c7-c01f-47e0-8767-aa8fe53cd476) ## Arguments data_impl ....................... mmap........................updated deepspeed_extra_args ............ {'bf16': {'enabled': True}}.updated dynamic_loss_scale .............. True........................updated eval_interval ................... 40000.......................updated eval_iters...

stale

I asked this question in the discussion section but did not receive any response so asking here with a little bit more details. I am trying to figure out if...

- The most important thing is adding `// args.context_parallel_size`, otherwise when we scale very long sequence on many GPUs using context parallelism, it will cause OOM here. - Also add...

You may want to enter multiple context managers in one with-statement. However, `with rng_context and fp8_context` is equivalent to `with fp8_context`, because `rng_context` doesn't rewrite `__and__` method. The correct way...

stale

Hello, after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error. In the script it's mentionned that i have...