Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] Permormance drop while training with MoE

Open Teng-xu opened this issue 11 months ago • 8 comments

Describe the bug During our training sessions utilizing Megatron's Mixture of Experts (MoE) layers, we observed a decline in performance occurring at specific steps, with this deterioration manifesting sporadically and inconsistently throughout the training process. We have also done some memory profiling and found that the execution time is predominantly occupied by the all gather and reduce scatter calls, accounting for 99% of the time spent during the low-performance steps. Hence, we seek insights into potential causes of this performance issue.

To Reproduce

config = TransformerConfig(
    tensor_model_parallel_size=1, context_parallel_size=1, pipeline_model_parallel_size=1,
    expert_model_parallel_size=2, num_layers=32, hidden_size=4096, 
    num_attention_heads=32, layernorm_epsilon=1e-05,
    add_bias_linear=False, activation_func=F.silu, num_moe_experts=8,
    fp8=None, normalization='RMSNorm', moe_router_load_balancing_type='sinkhorn',
    moe_router_topk=2, moe_grouped_gemm=True, moe_aux_loss_coeff=0.0,
    moe_z_loss_coeff=None, moe_input_jitter_eps=None, moe_token_dropping=False
)

Expected behavior The throughput should be stable across the training steps.

Stack trace/logs

Batch 138 Loss: 6.425852298736572, Speed: 6.26 samples/sec, Model TFLOPS/GPU: 320.16
Batch 139 Loss: 6.429311275482178, Speed: 6.30 samples/sec, Model TFLOPS/GPU: 321.91
Batch 140 Loss: 6.296842575073242, Speed: 0.05 samples/sec, Model TFLOPS/GPU: 2.39
Batch 141 Loss: 6.297295570373535, Speed: 0.26 samples/sec, Model TFLOPS/GPU: 13.24

image(1)

Environment (please complete the following information):

  • Megatron-LM commit ID 5f9c870f9f24b482509699d206a9dbb00958f6fc
  • PyTorch version PT-2.1
  • CUDA version CUDA-12.1
  • NCCL version 2.18.3

Teng-xu avatar Feb 29 '24 00:02 Teng-xu

Also encountered the same problem with non-MoE model. I tried to run training job of Llama 13B model on two DGX A100 nodes, but the time breakdown shows:

    forward-backward ...............................: (5662.82, 5666.72)
    forward-compute ................................: (2146.30, 2210.56)
    backward-compute ...............................: (3431.20, 3509.58)
    batch-generator ................................: (17.31, 33.45)
    layernorm-grads-all-reduce .....................: (5.24, 218.94)
    embedding-grads-all-reduce .....................: (0.06, 0.11)
    all-grads-sync .................................: (215891.91, 225072.22)
    optimizer-copy-to-main-grad ....................: (9.13, 9.19)
    optimizer-unscale-and-check-inf ................: (9.69, 9.88)
    optimizer-clip-main-grad .......................: (14.55, 14.77)
    optimizer-count-zeros ..........................: (0.02, 0.07)
    optimizer-inner-step ...........................: (31.58, 32.33)
    optimizer-copy-main-to-model-params ............: (9.36, 9.57)
    optimizer ......................................: (77.15, 77.37)

(I disabled all overlap-* optimizations and distributed-optimizer for more accurate time breakdown. Gradient AllReduce takes more than 200 seconds while forward-backward takes just 5.6 seconds. The problem occurs regardless of using distributed-optimizer:

    forward-backward ...............................: (6640.79, 6647.08)
    forward-compute ................................: (3118.90, 3181.81)
    backward-compute ...............................: (3428.96, 3512.83)
    batch-generator ................................: (16.72, 34.26)
    layernorm-grads-all-reduce .....................: (4.97, 11.08)
    embedding-grads-all-reduce .....................: (0.06, 0.12)
    all-grads-sync .................................: (77025.69, 112368.28)
    params-all-gather ..............................: (77461.61, 112343.53)
    optimizer-copy-to-main-grad ....................: (4.65, 4.82)
    optimizer-unscale-and-check-inf ................: (5.37, 5.39)
    optimizer-clip-main-grad .......................: (7.70, 7.74)
    optimizer-count-zeros ..........................: (0.02, 0.03)
    optimizer-inner-step ...........................: (15.89, 16.30)
    optimizer-copy-main-to-model-params ............: (4.53, 4.56)
    optimizer ......................................: (77502.36, 112384.28)

When I run the same job on a single node, the problem disappears.

My environment is

  • Megatron commit ID: core_v0.4.0
  • NGC PyTorch container version: 23.04

ktaebum avatar Feb 29 '24 00:02 ktaebum

My issue has been resolved by passing --device=/dev/infiniband in docker run argument.

ktaebum avatar Feb 29 '24 05:02 ktaebum

ktaebum's issue is unrelated. We only notice slowdown in some steps, and due to intra node AllGather calls which are surprisingly high for those steps

rahul003 avatar Mar 01 '24 18:03 rahul003

I have encountered the same problem with MoE, when route type is sinkhorn and topK > 1.

image

From my log, I found the main comsumption is from sinkhorn function

norm_logits = sinkhorn(
                    logits.to(dtype=torch.float32)
                )

dawson-chen avatar Mar 25 '24 12:03 dawson-chen

When topk > 1 and route type is sinkhorn, the sinkhorn function inner code loop thousands times for some logits cases.

But I didn't found any clue on those logits, look similar with normal ones.

dawson-chen avatar Mar 27 '24 08:03 dawson-chen

@Teng-xu @dawson-chen Thanks for reporting this issue. This could be due to too many iterations in Sinkhorn on some ranks. You can try adding an early stop to Sinkhorn or using aux_loss for load balancing.

yanring avatar Apr 05 '24 08:04 yanring

how to get Model TFLOPS/GPU?

wen020 avatar Jun 04 '24 01:06 wen020

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Aug 03 '24 18:08 github-actions[bot]