Megatron-LM issues

Add support to track consumed samples for each data shard during training

1

# Issue: Pretraining jobs often span multiple weeks or months and it is often desirable to stop training, modify the dataset by adding/removing datashards or reweighting them, and resume training....

haidark

stale

[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )

7

When I use Megatron.core to train a moe model, I got the following bugs : **Output Info :** [rank2]:[E ProcessGroupNCCL.cpp:754] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1,...

ZhangEnmao

[QUESTION] Why not use tensor parallel APIs of pytorch

2

**Your question** Ask a clear and concise question about Megatron-LM. https://pytorch.org/docs/stable/distributed.tensor.parallel.html Why not use tensor parallel APIs of pytorch

GuWei007

stale

Update outdated method name passed to get linear_layer function to match intented method that was imported

1

Currently, `model/classification.py` throws an error because line 49 attempts to pass a method that doesn't exist (`init_method`) to `get_linear_layer`. This commit updates `classification.py` to instead pass `init_method_normal` to `get_linear_layer`, which...

OckermanSethGVSU

stale

[BUG] Modify FLOPs in MFU calculation for casual mask when using FlashAttention.

Hi, I suggest we modify the [FLOPs calculation in the MFU](https://github.com/NVIDIA/Megatron-LM/blob/c3677e09aa4e2eec37048307bd795928b8f8324a/megatron/training/training.py#L88-L95) according to the [FlashAttention benchmark script](https://github.com/Dao-AILab/flash-attention/blob/9c0e9ee86d0e0022b60deddb405c20ab77481582/benchmarks/benchmark_flash_attention.py#L27-L30). Specifically, the current calculation for the casual mask **can exceed 100% MFU** for...

Yuxin-CV

Question with forward_backward_pipelining_without_interleaving in Megatron-LM Pipeline

1

I encountered a problem when using the Megatron pipeline. The function I am using is forward_backward_pipelining_without_interleaving. In this pipeline function, each pipeline stage calls forward_step for the forward pass: output_tensor...

Hongjie1Chu

stale

[QUESTION] how to profile bubble time in pipeline parallelism?

1

**Your question** Ask a clear and concise question about Megatron-LM. How can I profile bubble time and p2p comm time in pipeline parallelism?

slowlyC

[BUG] cross-entropy loss not computed correctly when label_smoothing is enabled

1

**Describe the bug** Currently, when `label_smoothing` is enabled, `mean_log_probs` is computed as a local mean ([code pointer](https://github.com/NVIDIA/Megatron-LM/blob/a5415fcfacef2a37416259bd38b7c4b673583675/megatron/core/tensor_parallel/cross_entropy.py#L87)). This is not the expected behavior for label smoothing, and can cause the...

tianyu-l

stale

[Fix] Assertion to check if `num_layers` is divisible by the pipeline size

1

Move the assertion to check if num_layers is divisible by the pipeline size, as there is currently no assertion when args.num_layers_per_virtual_pipeline_stage is None, even though it should be divisible. I...

kenkenpa2126

[BUG]

**Issue Title** Megatron-LM: Zero-1 with Distributed Optimizer Showing No Overlap in Communication and Computation **Issue Description** We are experiencing an issue with Megatron-LM where enabling zero-1 (--overlap-grad-reduce --overlap-param-gather) along with...

chrisgao7

Megatron-LM
Megatron-LM copied to clipboard

Metadata

Add support to track consumed samples for each data shard during training

[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )

[QUESTION] Why not use tensor parallel APIs of pytorch

Update outdated method name passed to get linear_layer function to match intented method that was imported

[BUG] Modify FLOPs in MFU calculation for casual mask when using FlashAttention.

Question with forward_backward_pipelining_without_interleaving in Megatron-LM Pipeline

[QUESTION] how to profile bubble time in pipeline parallelism?

[BUG] cross-entropy loss not computed correctly when label_smoothing is enabled

[Fix] Assertion to check if `num_layers` is divisible by the pipeline size

[BUG]

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard