[BUG]cuda-graph-scope attn and external-cuda-graph
Describe the bug
Phase 1. The mixtral8*7b MoE model has runtime errors when --cuda-graph-scope attn and --external-cuda-graph are enabled.
configs:
8 * H100
- pipeline-model-parallel-size 8
- seq-length 4096
- ffn-hidden-size 14336
- num-layers 16
- hidden-size 4096
- num-attention-heads 32
- group-query-attention
- num-query-groups 8
- num-experts 8
- moe-router-topk 2
- mbs/gbs 1/128
In order to avoid the problem of the same name between CudaRNGStatesTracker in Megatron-LM and CudaRNGStatesTracker in TE, I commented some codes in TE.
Phase 2. After trying to turn on --te-rng-tracker, CUDA out of memory occurs.
Phase 3. Adjust mbs/gbs from 1/128 to 1/20,RuntimeError occurs.
Using --cuda-graph-scope attn configuration, it shouldn't have an impact on the MoE part?
To Reproduce
Open --cuda-graph-scope attn and --external-cuda-graph, and run the mixtral8*7b MoE model.
Expected behavior No bugs
Stack trace/logs
Environment (please complete the following information):
- Megatron-LM commit ID : cbc89b322c454a2de46edcbd1fc708669aeafd59
- PyTorch version : 2.7.0a0+7c8ec84dab.nv25.3
- CUDA version: 12.8.1.012
- NCCL version: 2.25.1
- NGC version: 25.03
Proposed fix N/A
Additional context N/A
As of today, --external-cuda-graph must go with --te-rng-tracker. I suspect your phase 3 error is still an OOM-caused strange behavior. Could you make some mini tests first such as running with only one layer per GPU and see whether it can run up or not? If it's truly an OOM issue, you can try to merge this PR to your TE. It saves a lot of memory caused by cudagraph.
As of today,
--external-cuda-graphmust go with--te-rng-tracker. I suspect your phase 3 error is still an OOM-caused strange behavior. Could you make some mini tests first such as running with only one layer per GPU and see whether it can run up or not? If it's truly an OOM issue, you can try to merge this PR to your TE. It saves a lot of memory caused by cudagraph.
I'm not sure if there is a problem with one layer, I can try this PR. Can you try the mixtral8*7b MoE model? What configuration of MoE model have you tested?
These are my arguments running 8*7b cudagraph. But I tested with 4 nodes: --position-embedding-type rope --normalization RMSNorm --swiglu --no-position-embedding --no-masked-softmax-fusion --tokenizer-type Llama2Tokenizer --tokenizer-model xxxxx/mixtral-tokenizer.model --ffn-hidden-size 14336 --group-query-attention --num-query-groups 8 --num-layers 32 --hidden-size 4096 --num-attention-heads 32 --seq-length 4096 --max-position-embeddings 4096 --use-flash-attn --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --hidden-dropout 0.0 --micro-batch-size 1 --global-batch-size 128 --train-samples 268554688 --lr-decay-samples 255126953 --lr-warmup-samples 162761 --lr 1.2e-4 --min-lr 1.2e-5 --lr-decay-style cosine --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.008 --bf16 --use-mcore-models --transformer-impl transformer_engine --overlap-grad-reduce --overlap-param-gather --external-cuda-graph --te-rng-tracker --cuda-graph-scope attn --fp8-format hybrid --fp8-amax-history-len 1024 --fp8-amax-compute-algo max --tensor-model-parallel-size 2 --pipeline-model-parallel-size 4 --context-parallel-size 1 --sequence-parallel --use-distributed-optimizer --num-layers-per-virtual-pipeline-stage 1 --data-path xxxxx --data-cache-path xxxxx --split 99,1,0 --log-interval 10 --save-interval 10000 --eval-interval 200 --eval-iters 32 --tensorboard-dir xxxxx --tensorboard-queue-size 100 --log-throughput --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-num-zeros-in-grad --distributed-timeout-minutes 6000 --exit-duration-in-mins 230 --save xxxxx --load xxxxx --num-experts 8 --expert-model-parallel-size 4 --moe-router-load-balancing-type aux_loss --moe-router-topk 2 --moe-aux-loss-coeff 1e-2 --moe-token-dispatcher-type alltoall
With the above TE PR, I observed nearly no memory growth than no cudagraph.
These are my arguments running 8*7b cudagraph. But I tested with 4 nodes:
--position-embedding-type rope --normalization RMSNorm --swiglu --no-position-embedding --no-masked-softmax-fusion --tokenizer-type Llama2Tokenizer --tokenizer-model xxxxx/mixtral-tokenizer.model --ffn-hidden-size 14336 --group-query-attention --num-query-groups 8 --num-layers 32 --hidden-size 4096 --num-attention-heads 32 --seq-length 4096 --max-position-embeddings 4096 --use-flash-attn --untie-embeddings-and-output-weights --disable-bias-linear --attention-dropout 0.0 --hidden-dropout 0.0 --micro-batch-size 1 --global-batch-size 128 --train-samples 268554688 --lr-decay-samples 255126953 --lr-warmup-samples 162761 --lr 1.2e-4 --min-lr 1.2e-5 --lr-decay-style cosine --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.008 --bf16 --use-mcore-models --transformer-impl transformer_engine --overlap-grad-reduce --overlap-param-gather --external-cuda-graph --te-rng-tracker --cuda-graph-scope attn --fp8-format hybrid --fp8-amax-history-len 1024 --fp8-amax-compute-algo max --tensor-model-parallel-size 2 --pipeline-model-parallel-size 4 --context-parallel-size 1 --sequence-parallel --use-distributed-optimizer --num-layers-per-virtual-pipeline-stage 1 --data-path xxxxx --data-cache-path xxxxx --split 99,1,0 --log-interval 10 --save-interval 10000 --eval-interval 200 --eval-iters 32 --tensorboard-dir xxxxx --tensorboard-queue-size 100 --log-throughput --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-num-zeros-in-grad --distributed-timeout-minutes 6000 --exit-duration-in-mins 230 --save xxxxx --load xxxxx --num-experts 8 --expert-model-parallel-size 4 --moe-router-load-balancing-type aux_loss --moe-router-topk 2 --moe-aux-loss-coeff 1e-2 --moe-token-dispatcher-type alltoallWith the above TE PR, I observed nearly no memory growth than no cudagraph.
I used the above PR, but the problem was not solved. I observed that the gpu memory kept growing and finally there was an OOM problem. Does io_memory_reduction need to be turned on? I didn't see an option in the configuration you posted above, but looking at the code logic, it needs to be added.
when I used io_memory_reduction = True, The following error occurred.
configs:
- 8 * H100
- pipeline-model-parallel-size 8
- seq-length 4096
- max-position-embeddings 4096
- ffn-hidden-size 14336
- num-layers 16
- hidden-size 4096
- num-attention-heads 32
- group-query-attention
- num-query-groups 8
- num-experts 8
- moe-router-topk 2
- mbs/gbs 1/128
- bf16
- cuda-graph-scope attn
- external-cuda-graph
- te-rng-tracker
- NGC-24.08
Correct, you need to pass io_memory_reduction = True to make_graphed_callables to enable it.
Your error seems so weird, I cannot think of a reason why the old and new data pointers mismatch... I suspect it has something to do with the pytorch version. I'm using NGC25.02, do you mind switching to this version and having a try?
I may also try with your single-node config this week.
io_memory_reduction = True
hi, @buptzyb , I used io_memory_reduction = True, megatron commit: 07101375c8a824cc1c4e61848f24f1ac4840b23b, te commit: 4c39e40fc00f2120a781a4892c6043f9e89c2033.
TE: git clone https://github.com/buptzyb/TransformerEngine.git, and checkout cudagraph_reuse branch.
NGC-25.05 and 8 * H100
MODEL_ARGS=(
--use-mcore-models
--transformer-impl "transformer_engine"
--disable-bias-linear
--seq-length 4096
--max-position-embeddings 4096
--ffn-hidden-size 4096 #4h
--num-layers 72
--hidden-size 1024
--num-attention-heads 64
--group-query-attention
--num-query-groups 8
--init-method-std 0.008
--attention-dropout 0.0
--hidden-dropout 0.0
--normalization RMSNorm
--norm-epsilon 1e-5
--untie-embeddings-and-output-weights
--position-embedding-type rope
--rotary-percent 1.0
--swiglu
--no-masked-softmax-fusion
--no-position-embedding
--use-flash-attn
--overlap-grad-reduce
--overlap-param-gather
--ckpt-format torch_dist
--te-rng-tracker
--cuda-graph-scope attn
--external-cuda-graph
)
MOE_ARGS=(
--num-experts 8
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--moe-router-load-balancing-type aux_loss # options: aux_loss, sinkhorn, None. Default is aux_loss.
--moe-router-topk 4
--moe-grouped-gemm
--moe-aux-loss-coeff 1e-2
--moe-z-loss-coeff 1e-3
--moe-token-dispatcher-type alltoall
)
DATA_ARGS=(
--data-path $DATA_PATH
--split 100,0,0
--tokenizer-type Llama2Tokenizer
--tokenizer-model ${TOKENIZER_MODEL_PATH}
--data-cache-path $DATA_CACHE_PATH
)
TRAINING_ARGS=(
--micro-batch-size 1
--global-batch-size 128
--lr 2.6e-4
--train-iters 1000
--lr-decay-iters 1000
--lr-decay-style cosine
--min-lr 2.6e-5
--weight-decay 0.1
--lr-warmup-iters 200
--clip-grad 1.0
--bf16
--adam-beta1 0.9
--adam-beta2 0.95
--adam-eps 1e-8
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 8
--use-distributed-optimizer
--sequence-parallel
)
OOM will occur when using --te-rng-tracker,--cuda-graph-scope attn,--external-cuda-graph
Turning off --te-rng-tracker,--cuda-graph-scope attn,--external-cuda-graph can run normally. mem: 58.151GB
Hi @Baibaifan , I tested with your configuration on my side, and everything goes well... Here is the memory log, cudagraph (orange) takes about 2GB more memory than non-cudagraph (green):
And here is the throughput log:
So I'm not sure what's wrong on your side... I'm on a local megatron branch that is ahead of you of several commits, but I don't think it has something to do with cudagraph memory optimization... Maybe you can print more information inside TE graph.py to check whether io_memory_reduction is really enabled? Or do you have your wandb information?
To double check, here is my command:
torchrun --standalone --nnodes=1 --nproc-per-node=8 xxx/megatron-lm/pretrain_gpt.py --use-mcore-models --transformer-impl transformer_engine --disable-bias-linear --seq-length 4096 --max-position-embeddings 4096 --ffn-hidden-size 4096 --num-layers 72 --hidden-size 1024 --num-attention-heads 64 --group-query-attention --num-query-groups 8 --init-method-std 0.008 --attention-dropout 0.0 --hidden-dropout 0.0 --normalization RMSNorm --norm-epsilon 1e-5 --untie-embeddings-and-output-weights --position-embedding-type rope --rotary-percent 1.0 --swiglu --no-masked-softmax-fusion --no-position-embedding --use-flash-attn --overlap-grad-reduce --overlap-param-gather --ckpt-format torch_dist --te-rng-tracker --cuda-graph-scope attn --external-cuda-graph --num-experts 8 --expert-model-parallel-size 1 --expert-tensor-parallel-size 1 --moe-router-load-balancing-type aux_loss --moe-router-topk 4 --moe-grouped-gemm --moe-aux-loss-coeff 1e-2 --moe-z-loss-coeff 1e-3 --moe-token-dispatcher-type alltoall --data-path xxx/datasets/wudao_mistralbpe_content_document --data-cache-path xxx/baibaifan/cache --split 99,1,0 --tokenizer-type Llama2Tokenizer --tokenizer-model xxx/mixtral-tokenizer.model --micro-batch-size 1 --global-batch-size 128 --lr 2.6e-4 --train-iters 1000 --lr-decay-iters 1000 --lr-decay-style cosine --min-lr 2.6e-5 --weight-decay 0.1 --lr-warmup-iters 200 --clip-grad 1.0 --bf16 --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --tensor-model-parallel-size 1 --pipeline-model-parallel-size 8 --use-distributed-optimizer --sequence-parallel --log-interval 10 --save-interval 10000 --eval-interval 2000 --eval-iters 32 --tensorboard-dir xxx/baibaifan/tensorboard/interactive_test_PP8EP1TP1CP1VPP1 --tensorboard-queue-size 100 --log-throughput --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-num-zeros-in-grad --distributed-timeout-minutes 6000 --exit-duration-in-mins 230 --wandb-project baibaifan --wandb-exp-name interactive_test_PP8EP1TP1CP1VPP1_cg --wandb-save-dir xxx/baibaifan/wandb/interactive_test_PP8EP1TP1CP1VPP1 --save xxx/baibaifan/checkpoints/interactive_test_PP8EP1TP1CP1VPP1 --load xxx/baibaifan/checkpoints/interactive_test_PP8EP1TP1CP1VPP1
hi, @buptzyb, The oom problem has been solved. It was a problem with the TE installation. The performance of NGC-25.05 and 8 * H100 with cuda-graph turned on is not as good as that of NGC-24.08 and 8 * H100 without cuda-graph.
NGC-25.05 and 8 * H100 (no drop token):
- no cuda graph: TFLOPS: 115.5, mem: 56.587GB
- Turning on
--te-rng-tracker,--cuda-graph-scope attn,--external-cuda-graph, io_memory_reduction=Truecan run normally. TFLOPS: 132.4, mem: 57.333GB
NGC-24.08 and 8 * H100 (no drop token):
- no cuda graph: TFLOPS: 155.7, mem: 56.057GB
- Turning on
--te-rng-tracker,--cuda-graph-scope attn,--external-cuda-graph, io_memory_reduction=True, The following error occurred.
MODEL_ARGS=(
--use-mcore-models
--transformer-impl "transformer_engine"
--disable-bias-linear
--seq-length 4096
--max-position-embeddings 4096
--ffn-hidden-size 4096 #4h
--num-layers 72
--hidden-size 1024
--num-attention-heads 64
--group-query-attention
--num-query-groups 8
--init-method-std 0.008
--attention-dropout 0.0
--hidden-dropout 0.0
--normalization RMSNorm
--norm-epsilon 1e-5
--untie-embeddings-and-output-weights
--position-embedding-type rope
--rotary-percent 1.0
--swiglu
--no-masked-softmax-fusion
--no-position-embedding
--use-flash-attn
--overlap-grad-reduce
--overlap-param-gather
--ckpt-format torch_dist
--te-rng-tracker
--cuda-graph-scope attn
--external-cuda-graph
)
MOE_ARGS=(
--num-experts 8
--expert-model-parallel-size 1
--expert-tensor-parallel-size 1
--moe-router-load-balancing-type aux_loss # options: aux_loss, sinkhorn, None. Default is aux_loss.
--moe-router-topk 4
--moe-grouped-gemm
--moe-aux-loss-coeff 1e-2
--moe-z-loss-coeff 1e-3
--moe-token-dispatcher-type alltoall
)
DATA_ARGS=(
--data-path $DATA_PATH
--split 100,0,0
--tokenizer-type Llama2Tokenizer
--tokenizer-model ${TOKENIZER_MODEL_PATH}
--data-cache-path $DATA_CACHE_PATH
)
TRAINING_ARGS=(
--micro-batch-size 1
--global-batch-size 128
--lr 2.6e-4
--train-iters 1000
--lr-decay-iters 1000
--lr-decay-style cosine
--min-lr 2.6e-5
--weight-decay 0.1
--lr-warmup-iters 200
--clip-grad 1.0
--bf16
--adam-beta1 0.9
--adam-beta2 0.95
--adam-eps 1e-8
)
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size 1
--pipeline-model-parallel-size 8
--use-distributed-optimizer
--sequence-parallel
)
What's the throughput when the moe balance loss is low enough? If you compare the throughput just at the beginning steps, the numbers may be unreasonable.
What's the throughput when the moe balance loss is low enough? If you compare the throughput just at the beginning steps, the numbers may be unreasonable.
For the same strategy, even in the initial stage, will there be a big difference if different NGC images are used?
There is a problem with the NGC-24.08 image using cuda graph, which needs to be resolved.
Made some tests to find that this problem is in <=24.09. It can run normally in 24.10. So I suspect one commit in pytorch between 24.09 and 24.10 made this support. So, if you'd like to stick to 24.08, you'll have to find that commit and cherry-pick it. I counted that there are 1321 commits between 24.09 and 24.10, so using bisection can help you find the target in about 10 attempts.
oh, an easier way is to remove the two make_weak_ref calls in graph.py, like replacing the per_callable_static_outputs[per_callable_bwd_idx] = make_weak_ref(static_outputs) with per_callable_static_outputs[per_callable_bwd_idx] = static_outputs. This makes you fully get rid of the problem path. But also leads to extra memory usage.
oh, an easier way is to remove the two
make_weak_refcalls ingraph.py, like replacing theper_callable_static_outputs[per_callable_bwd_idx] = make_weak_ref(static_outputs)withper_callable_static_outputs[per_callable_bwd_idx] = static_outputs. This makes you fully get rid of the problem path. But also leads to extra memory usage.
OK, thanks.
Closing this as issue seems to be resolved.