Megatron-LM [BUG]cuda-graph-scope attn and external-cuda-graph

Describe the bug Phase 1. The mixtral8*7b MoE model has runtime errors when --cuda-graph-scope attn and --external-cuda-graph are enabled.

configs:

8 * H100

pipeline-model-parallel-size 8
seq-length 4096
ffn-hidden-size 14336
num-layers 16
hidden-size 4096
num-attention-heads 32
group-query-attention
num-query-groups 8
num-experts 8
moe-router-topk 2
mbs/gbs 1/128

In order to avoid the problem of the same name between CudaRNGStatesTracker in Megatron-LM and CudaRNGStatesTracker in TE, I commented some codes in TE.

Phase 2. After trying to turn on --te-rng-tracker, CUDA out of memory occurs.

Phase 3. Adjust mbs/gbs from 1/128 to 1/20，RuntimeError occurs.

Using --cuda-graph-scope attn configuration, it shouldn't have an impact on the MoE part?

To Reproduce Open --cuda-graph-scope attn and --external-cuda-graph, and run the mixtral8*7b MoE model.

Expected behavior No bugs

Stack trace/logs

Environment (please complete the following information):

Megatron-LM commit ID : cbc89b322c454a2de46edcbd1fc708669aeafd59
PyTorch version : 2.7.0a0+7c8ec84dab.nv25.3
CUDA version: 12.8.1.012
NCCL version: 2.25.1
NGC version: 25.03

Proposed fix N/A

Additional context N/A

May 06 '25 11:05 Baibaifan

As of today, --external-cuda-graph must go with --te-rng-tracker. I suspect your phase 3 error is still an OOM-caused strange behavior. Could you make some mini tests first such as running with only one layer per GPU and see whether it can run up or not? If it's truly an OOM issue, you can try to merge this PR to your TE. It saves a lot of memory caused by cudagraph.

May 22 '25 09:05 buptzyb

As of today, --external-cuda-graph must go with --te-rng-tracker. I suspect your phase 3 error is still an OOM-caused strange behavior. Could you make some mini tests first such as running with only one layer per GPU and see whether it can run up or not? If it's truly an OOM issue, you can try to merge this PR to your TE. It saves a lot of memory caused by cudagraph.

I'm not sure if there is a problem with one layer, I can try this PR. Can you try the mixtral8*7b MoE model? What configuration of MoE model have you tested?

May 22 '25 10:05 Baibaifan

These are my arguments running 8*7b cudagraph. But I tested with 4 nodes: --position-embedding-type rope --normalization RMSNorm --swiglu --no-position-embedding --no-masked-softmax-fusion --tokenizer-type Llama2Tokenizer --tokenizer-model xxxxx/mixtral-tokenizer.model --ffn-hidden-size 14336 --group-query-attention --num-query-groups 8 --num-layers 32 --hidden-size 4096 --num-attention-heads 32 --seq-length 4096 --max-position-embeddings 4096 --use-flash-attn --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --hidden-dropout 0.0 --micro-batch-size 1 --global-batch-size 128 --train-samples 268554688 --lr-decay-samples 255126953 --lr-warmup-samples 162761 --lr 1.2e-4 --min-lr 1.2e-5 --lr-decay-style cosine --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.008 --bf16 --use-mcore-models --transformer-impl transformer_engine --overlap-grad-reduce --overlap-param-gather --external-cuda-graph --te-rng-tracker --cuda-graph-scope attn --fp8-format hybrid --fp8-amax-history-len 1024 --fp8-amax-compute-algo max --tensor-model-parallel-size 2 --pipeline-model-parallel-size 4 --context-parallel-size 1 --sequence-parallel --use-distributed-optimizer --num-layers-per-virtual-pipeline-stage 1 --data-path xxxxx --data-cache-path xxxxx --split 99,1,0 --log-interval 10 --save-interval 10000 --eval-interval 200 --eval-iters 32 --tensorboard-dir xxxxx --tensorboard-queue-size 100 --log-throughput --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-num-zeros-in-grad --distributed-timeout-minutes 6000 --exit-duration-in-mins 230 --save xxxxx --load xxxxx --num-experts 8 --expert-model-parallel-size 4 --moe-router-load-balancing-type aux_loss --moe-router-topk 2 --moe-aux-loss-coeff 1e-2 --moe-token-dispatcher-type alltoall

With the above TE PR, I observed nearly no memory growth than no cudagraph.

May 23 '25 08:05 buptzyb

These are my arguments running 8*7b cudagraph. But I tested with 4 nodes: --position-embedding-type rope --normalization RMSNorm --swiglu --no-position-embedding --no-masked-softmax-fusion --tokenizer-type Llama2Tokenizer --tokenizer-model xxxxx/mixtral-tokenizer.model --ffn-hidden-size 14336 --group-query-attention --num-query-groups 8 --num-layers 32 --hidden-size 4096 --num-attention-heads 32 --seq-length 4096 --max-position-embeddings 4096 --use-flash-attn --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --hidden-dropout 0.0 --micro-batch-size 1 --global-batch-size 128 --train-samples 268554688 --lr-decay-samples 255126953 --lr-warmup-samples 162761 --lr 1.2e-4 --min-lr 1.2e-5 --lr-decay-style cosine --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.008 --bf16 --use-mcore-models --transformer-impl transformer_engine --overlap-grad-reduce --overlap-param-gather --external-cuda-graph --te-rng-tracker --cuda-graph-scope attn --fp8-format hybrid --fp8-amax-history-len 1024 --fp8-amax-compute-algo max --tensor-model-parallel-size 2 --pipeline-model-parallel-size 4 --context-parallel-size 1 --sequence-parallel --use-distributed-optimizer --num-layers-per-virtual-pipeline-stage 1 --data-path xxxxx --data-cache-path xxxxx --split 99,1,0 --log-interval 10 --save-interval 10000 --eval-interval 200 --eval-iters 32 --tensorboard-dir xxxxx --tensorboard-queue-size 100 --log-throughput --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-num-zeros-in-grad --distributed-timeout-minutes 6000 --exit-duration-in-mins 230 --save xxxxx --load xxxxx --num-experts 8 --expert-model-parallel-size 4 --moe-router-load-balancing-type aux_loss --moe-router-topk 2 --moe-aux-loss-coeff 1e-2 --moe-token-dispatcher-type alltoall

With the above TE PR, I observed nearly no memory growth than no cudagraph.

I used the above PR, but the problem was not solved. I observed that the gpu memory kept growing and finally there was an OOM problem. Does io_memory_reduction need to be turned on? I didn't see an option in the configuration you posted above, but looking at the code logic, it needs to be added.

when I used io_memory_reduction = True, The following error occurred.

configs:

8 * H100
pipeline-model-parallel-size 8
seq-length 4096
max-position-embeddings 4096
ffn-hidden-size 14336
num-layers 16
hidden-size 4096
num-attention-heads 32
group-query-attention
num-query-groups 8
num-experts 8
moe-router-topk 2
mbs/gbs 1/128
bf16
cuda-graph-scope attn
external-cuda-graph
te-rng-tracker
NGC-24.08

May 26 '25 08:05 Baibaifan

Correct, you need to pass io_memory_reduction = True to make_graphed_callables to enable it.

Your error seems so weird, I cannot think of a reason why the old and new data pointers mismatch... I suspect it has something to do with the pytorch version. I'm using NGC25.02, do you mind switching to this version and having a try?

I may also try with your single-node config this week.

May 26 '25 12:05 buptzyb

io_memory_reduction = True

hi, @buptzyb , I used io_memory_reduction = True, megatron commit: 07101375c8a824cc1c4e61848f24f1ac4840b23b, te commit: 4c39e40fc00f2120a781a4892c6043f9e89c2033.

TE: git clone https://github.com/buptzyb/TransformerEngine.git, and checkout cudagraph_reuse branch.

NGC-25.05 and 8 * H100

MODEL_ARGS=(
    --use-mcore-models
    --transformer-impl "transformer_engine"
    --disable-bias-linear
    --seq-length 4096
    --max-position-embeddings 4096
    --ffn-hidden-size 4096 #4h
    --num-layers 72
    --hidden-size 1024
    --num-attention-heads 64
    --group-query-attention
    --num-query-groups 8
    --init-method-std 0.008
    --attention-dropout 0.0
    --hidden-dropout 0.0
    --normalization RMSNorm
    --norm-epsilon 1e-5
    --untie-embeddings-and-output-weights
    --position-embedding-type rope
    --rotary-percent 1.0
    --swiglu
    --no-masked-softmax-fusion
    --no-position-embedding
    --use-flash-attn
    --overlap-grad-reduce
    --overlap-param-gather
    --ckpt-format torch_dist
    --te-rng-tracker
    --cuda-graph-scope attn
    --external-cuda-graph
)

MOE_ARGS=(
    --num-experts 8
    --expert-model-parallel-size 1
    --expert-tensor-parallel-size 1
    --moe-router-load-balancing-type aux_loss # options: aux_loss, sinkhorn, None. Default is aux_loss.
    --moe-router-topk 4
    --moe-grouped-gemm
    --moe-aux-loss-coeff 1e-2
    --moe-z-loss-coeff 1e-3
    --moe-token-dispatcher-type alltoall
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 100,0,0
    --tokenizer-type Llama2Tokenizer
    --tokenizer-model ${TOKENIZER_MODEL_PATH}
    --data-cache-path $DATA_CACHE_PATH
)

TRAINING_ARGS=(
    --micro-batch-size 1
    --global-batch-size 128
    --lr 2.6e-4
    --train-iters 1000
    --lr-decay-iters 1000
    --lr-decay-style cosine
    --min-lr 2.6e-5
    --weight-decay 0.1
    --lr-warmup-iters 200
    --clip-grad 1.0
    --bf16
    --adam-beta1 0.9
    --adam-beta2 0.95
    --adam-eps 1e-8
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 8
    --use-distributed-optimizer
    --sequence-parallel
)

OOM will occur when using --te-rng-tracker，--cuda-graph-scope attn，--external-cuda-graph

Turning off --te-rng-tracker，--cuda-graph-scope attn，--external-cuda-graph can run normally. mem: 58.151GB

Jun 18 '25 06:06 Baibaifan

Hi @Baibaifan , I tested with your configuration on my side, and everything goes well... Here is the memory log, cudagraph (orange) takes about 2GB more memory than non-cudagraph (green):

And here is the throughput log:

So I'm not sure what's wrong on your side... I'm on a local megatron branch that is ahead of you of several commits, but I don't think it has something to do with cudagraph memory optimization... Maybe you can print more information inside TE graph.py to check whether io_memory_reduction is really enabled? Or do you have your wandb information?

To double check, here is my command:

torchrun --standalone --nnodes=1 --nproc-per-node=8 xxx/megatron-lm/pretrain_gpt.py --use-mcore-models --transformer-impl transformer_engine --disable-bias-linear --seq-length 4096 --max-position-embeddings 4096 --ffn-hidden-size 4096 --num-layers 72 --hidden-size 1024 --num-attention-heads 64 --group-query-attention --num-query-groups 8 --init-method-std 0.008 --attention-dropout 0.0 --hidden-dropout 0.0 --normalization RMSNorm --norm-epsilon 1e-5 --untie-embeddings-and-output-weights --position-embedding-type rope --rotary-percent 1.0 --swiglu --no-masked-softmax-fusion --no-position-embedding --use-flash-attn --overlap-grad-reduce --overlap-param-gather --ckpt-format torch_dist --te-rng-tracker --cuda-graph-scope attn --external-cuda-graph --num-experts 8 --expert-model-parallel-size 1 --expert-tensor-parallel-size 1 --moe-router-load-balancing-type aux_loss --moe-router-topk 4 --moe-grouped-gemm --moe-aux-loss-coeff 1e-2 --moe-z-loss-coeff 1e-3 --moe-token-dispatcher-type alltoall --data-path xxx/datasets/wudao_mistralbpe_content_document --data-cache-path xxx/baibaifan/cache --split 99,1,0 --tokenizer-type Llama2Tokenizer --tokenizer-model xxx/mixtral-tokenizer.model --micro-batch-size 1 --global-batch-size 128 --lr 2.6e-4 --train-iters 1000 --lr-decay-iters 1000 --lr-decay-style cosine --min-lr 2.6e-5 --weight-decay 0.1 --lr-warmup-iters 200 --clip-grad 1.0 --bf16 --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --tensor-model-parallel-size 1 --pipeline-model-parallel-size 8 --use-distributed-optimizer --sequence-parallel --log-interval 10 --save-interval 10000 --eval-interval 2000 --eval-iters 32 --tensorboard-dir xxx/baibaifan/tensorboard/interactive_test_PP8EP1TP1CP1VPP1 --tensorboard-queue-size 100 --log-throughput --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-num-zeros-in-grad --distributed-timeout-minutes 6000 --exit-duration-in-mins 230 --wandb-project baibaifan --wandb-exp-name interactive_test_PP8EP1TP1CP1VPP1_cg --wandb-save-dir xxx/baibaifan/wandb/interactive_test_PP8EP1TP1CP1VPP1 --save xxx/baibaifan/checkpoints/interactive_test_PP8EP1TP1CP1VPP1 --load xxx/baibaifan/checkpoints/interactive_test_PP8EP1TP1CP1VPP1

Jun 19 '25 03:06 buptzyb

hi, @buptzyb, The oom problem has been solved. It was a problem with the TE installation. The performance of NGC-25.05 and 8 * H100 with cuda-graph turned on is not as good as that of NGC-24.08 and 8 * H100 without cuda-graph.

NGC-25.05 and 8 * H100 (no drop token):

no cuda graph: TFLOPS: 115.5, mem: 56.587GB
Turning on --te-rng-tracker，--cuda-graph-scope attn，--external-cuda-graph, io_memory_reduction=True can run normally. TFLOPS: 132.4, mem: 57.333GB

NGC-24.08 and 8 * H100 (no drop token):

no cuda graph: TFLOPS: 155.7, mem: 56.057GB
Turning on --te-rng-tracker，--cuda-graph-scope attn，--external-cuda-graph, io_memory_reduction=True, The following error occurred.

MODEL_ARGS=(
    --use-mcore-models
    --transformer-impl "transformer_engine"
    --disable-bias-linear
    --seq-length 4096
    --max-position-embeddings 4096
    --ffn-hidden-size 4096 #4h
    --num-layers 72
    --hidden-size 1024
    --num-attention-heads 64
    --group-query-attention
    --num-query-groups 8
    --init-method-std 0.008
    --attention-dropout 0.0
    --hidden-dropout 0.0
    --normalization RMSNorm
    --norm-epsilon 1e-5
    --untie-embeddings-and-output-weights
    --position-embedding-type rope
    --rotary-percent 1.0
    --swiglu
    --no-masked-softmax-fusion
    --no-position-embedding
    --use-flash-attn
    --overlap-grad-reduce
    --overlap-param-gather
    --ckpt-format torch_dist
    --te-rng-tracker
    --cuda-graph-scope attn
    --external-cuda-graph
)

MOE_ARGS=(
    --num-experts 8
    --expert-model-parallel-size 1
    --expert-tensor-parallel-size 1
    --moe-router-load-balancing-type aux_loss # options: aux_loss, sinkhorn, None. Default is aux_loss.
    --moe-router-topk 4
    --moe-grouped-gemm
    --moe-aux-loss-coeff 1e-2
    --moe-z-loss-coeff 1e-3
    --moe-token-dispatcher-type alltoall
)

DATA_ARGS=(
    --data-path $DATA_PATH
    --split 100,0,0
    --tokenizer-type Llama2Tokenizer
    --tokenizer-model ${TOKENIZER_MODEL_PATH}
    --data-cache-path $DATA_CACHE_PATH
)

TRAINING_ARGS=(
    --micro-batch-size 1
    --global-batch-size 128
    --lr 2.6e-4
    --train-iters 1000
    --lr-decay-iters 1000
    --lr-decay-style cosine
    --min-lr 2.6e-5
    --weight-decay 0.1
    --lr-warmup-iters 200
    --clip-grad 1.0
    --bf16
    --adam-beta1 0.9
    --adam-beta2 0.95
    --adam-eps 1e-8
)

MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 8
    --use-distributed-optimizer
    --sequence-parallel
)

Jun 19 '25 08:06 Baibaifan

What's the throughput when the moe balance loss is low enough? If you compare the throughput just at the beginning steps, the numbers may be unreasonable.

Jun 19 '25 09:06 buptzyb

What's the throughput when the moe balance loss is low enough? If you compare the throughput just at the beginning steps, the numbers may be unreasonable.

For the same strategy, even in the initial stage, will there be a big difference if different NGC images are used?

There is a problem with the NGC-24.08 image using cuda graph, which needs to be resolved.

Jun 19 '25 09:06 Baibaifan

Made some tests to find that this problem is in <=24.09. It can run normally in 24.10. So I suspect one commit in pytorch between 24.09 and 24.10 made this support. So, if you'd like to stick to 24.08, you'll have to find that commit and cherry-pick it. I counted that there are 1321 commits between 24.09 and 24.10, so using bisection can help you find the target in about 10 attempts.

Jun 24 '25 11:06 buptzyb

oh, an easier way is to remove the two make_weak_ref calls in graph.py, like replacing the per_callable_static_outputs[per_callable_bwd_idx] = make_weak_ref(static_outputs) with per_callable_static_outputs[per_callable_bwd_idx] = static_outputs. This makes you fully get rid of the problem path. But also leads to extra memory usage.

Jun 24 '25 11:06 buptzyb

oh, an easier way is to remove the two make_weak_ref calls in graph.py, like replacing the per_callable_static_outputs[per_callable_bwd_idx] = make_weak_ref(static_outputs) with per_callable_static_outputs[per_callable_bwd_idx] = static_outputs. This makes you fully get rid of the problem path. But also leads to extra memory usage.

OK, thanks.

Jul 03 '25 09:07 Baibaifan

Closing this as issue seems to be resolved.

Jul 03 '25 20:07 shanmugamr1992