Megatron-LM
Megatron-LM copied to clipboard
Ongoing research training transformer models at scale
**Your question** Ask a clear and concise question about Megatron-LM. Can we have a sample idx + bin files as required by the pretrain_gpt.py ? Running tools/preprocess_data.py on some sample...
Add Hoper llama2 7b mcore gold example
This minor patch is to fix typo in moe introduction document.
**Describe the bug** Hi, I think there is a bug when context parallel is on and we can discuss it. https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/pretrain_gpt.py#L148 From this [issue](https://github.com/NVIDIA/Megatron-LM/issues/673),i know the result is same for...
This func is wrong, the program will hang because of the "group" variable. def _batched_p2p_ops( *, tensor_send_prev: Optional[torch.Tensor], tensor_recv_prev: Optional[torch.Tensor], tensor_send_next: Optional[torch.Tensor], tensor_recv_next: Optional[torch.Tensor], group: torch.distributed.ProcessGroup ) after modified: def...
# Batch_input and elapsed time per iteration slow down during model training  ## Arguments data_impl ....................... mmap........................updated deepspeed_extra_args ............ {'bf16': {'enabled': True}}.updated dynamic_loss_scale .............. True........................updated eval_interval ................... 40000.......................updated eval_iters...
I asked this question in the discussion section but did not receive any response so asking here with a little bit more details. I am trying to figure out if...
- The most important thing is adding `// args.context_parallel_size`, otherwise when we scale very long sequence on many GPUs using context parallelism, it will cause OOM here. - Also add...
You may want to enter multiple context managers in one with-statement. However, `with rng_context and fp8_context` is equivalent to `with fp8_context`, because `rng_context` doesn't rewrite `__and__` method. The correct way...
Hello, after pre-processing the dataset with a BPE tokenizer, when i launch the 'train.sh' script for mamba i do get this error. In the script it's mentionned that i have...