Megatron-LM
Megatron-LM copied to clipboard
Ongoing research training transformer models at scale
# Issue: Pretraining jobs often span multiple weeks or months and it is often desirable to stop training, modify the dataset by adding/removing datashards or reweighting them, and resume training....
When I use Megatron.core to train a moe model, I got the following bugs : **Output Info :** [rank2]:[E ProcessGroupNCCL.cpp:754] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1,...
**Your question** Ask a clear and concise question about Megatron-LM. https://pytorch.org/docs/stable/distributed.tensor.parallel.html Why not use tensor parallel APIs of pytorch
Currently, `model/classification.py` throws an error because line 49 attempts to pass a method that doesn't exist (`init_method`) to `get_linear_layer`. This commit updates `classification.py` to instead pass `init_method_normal` to `get_linear_layer`, which...
Hi, I suggest we modify the [FLOPs calculation in the MFU](https://github.com/NVIDIA/Megatron-LM/blob/c3677e09aa4e2eec37048307bd795928b8f8324a/megatron/training/training.py#L88-L95) according to the [FlashAttention benchmark script](https://github.com/Dao-AILab/flash-attention/blob/9c0e9ee86d0e0022b60deddb405c20ab77481582/benchmarks/benchmark_flash_attention.py#L27-L30). Specifically, the current calculation for the casual mask **can exceed 100% MFU** for...
I encountered a problem when using the Megatron pipeline. The function I am using is forward_backward_pipelining_without_interleaving. In this pipeline function, each pipeline stage calls forward_step for the forward pass: output_tensor...
**Your question** Ask a clear and concise question about Megatron-LM. How can I profile bubble time and p2p comm time in pipeline parallelism?
**Describe the bug** Currently, when `label_smoothing` is enabled, `mean_log_probs` is computed as a local mean ([code pointer](https://github.com/NVIDIA/Megatron-LM/blob/a5415fcfacef2a37416259bd38b7c4b673583675/megatron/core/tensor_parallel/cross_entropy.py#L87)). This is not the expected behavior for label smoothing, and can cause the...
Move the assertion to check if num_layers is divisible by the pipeline size, as there is currently no assertion when args.num_layers_per_virtual_pipeline_stage is None, even though it should be divisible. I...
**Issue Title** Megatron-LM: Zero-1 with Distributed Optimizer Showing No Overlap in Communication and Computation **Issue Description** We are experiencing an issue with Megatron-LM where enabling zero-1 (--overlap-grad-reduce --overlap-param-gather) along with...