Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

feat(MoE): Refactor cuda_graph_scope

Open buptzyb opened this issue 2 months ago • 21 comments

dev branch PR #1917 & #2353 .

With this PR, --cuda-graph-scope in --cuda-graph-impl=transformer_engine mode now supports combinations of the six values:

  1. attn: captures operations in TransformerLayer._forward_attention().
  2. mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.
  3. moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.
  4. moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.
  5. moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together with moe_router.
  6. mamba: captures the mamba layer.
  • Example 1:

For a dense model, set --cuda-graph-scope attn mlp to capture the whole Transformer layer, or set --cuda-graph-scope attn to capture the attention part, or set --cuda-graph-scope mlp to capture the mlp part. The non-graphed part will go to the normal pass.

  • Example 2:

For a moe model, set --cuda-graph-scope attn moe_router moe_preprocess to capture operations from the beginning of the Transformer layer to the preprocess method in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set --cuda-graph-scope attn moe_router to capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set --cuda-graph-scope attn moe to capture the whole layer as its sync-free.

  • Example 3:

For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting --cuda-graph-scope attn mlp moe_router moe_preprocess captures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.

  • Example 4:

Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set --cuda-graph-scope mamba attn moe_router to capture the corresponding layers. Or you can also set --cuda-graph-scope attn moe_router if you don't want the mamba layers to be graphed.

image

buptzyb avatar Oct 24 '25 07:10 buptzyb

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Oct 24 '25 07:10 copy-pr-bot[bot]

/ok to test 3b85592a59cc24785baa41ae8d343fec87311344

buptzyb avatar Oct 27 '25 10:10 buptzyb

/ok to test 97e09403b38049f2007fed74e8f168dd6d8e984a

buptzyb avatar Oct 28 '25 03:10 buptzyb

/ok to test 770b510b59662610321525996ebbda4c1f12af41

buptzyb avatar Nov 03 '25 08:11 buptzyb

/ok to test ebb0e9d9b74a698abdcf7d99d01721fb859383ba

buptzyb avatar Nov 05 '25 07:11 buptzyb

@jiemingz @mathemakitten @sidsingh-nvidia @lmcafee-nvidia Can you please take a look at this MR?

kvareddy avatar Nov 05 '25 15:11 kvareddy

/ok to test c07462349a86e33af9211ccb44a13ba832683be5

buptzyb avatar Nov 10 '25 10:11 buptzyb

@jiemingz @mathemakitten @sidsingh-nvidia can you please take a look at this MR?

kvareddy avatar Nov 13 '25 10:11 kvareddy

/ok to test a35489f3ae2670f5b10dda306ead45b4cf814488

buptzyb avatar Nov 17 '25 12:11 buptzyb

@sidsingh-nvidia @lmcafee-nvidia @santhnm2 can you please take a look at this MR?

kvareddy avatar Nov 17 '25 17:11 kvareddy

All my outstanding questions/comments re: cudagraphs are resolved. I will defer formal approval on behalf of @NVIDIA/inference in case anyone else wants to weigh in before merge.

mathemakitten avatar Nov 18 '25 23:11 mathemakitten

All my outstanding questions/comments re: cudagraphs are resolved. I will defer formal approval on behalf of @NVIDIA/inference in case anyone else wants to weigh in before merge.

Thanks! @kvareddy @sidsingh-nvidia @lmcafee-nvidia @santhnm2 could you help review too?

buptzyb avatar Nov 19 '25 02:11 buptzyb

/ok to test 0337f2053bafd3affb50a56a0d01c8650141b98d

buptzyb avatar Nov 20 '25 05:11 buptzyb

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? Thanks!

buptzyb avatar Nov 20 '25 09:11 buptzyb

If one wants to use per-layer cuda-graphs (--cuda-graph-scope full as of today in main), do we set --cuda-graph-scope as attn mlp? In that case, are we doubling the number of cuda graphs as compared to main?

sidsingh-nvidia avatar Nov 21 '25 21:11 sidsingh-nvidia

If one wants to use per-layer cuda-graphs (--cuda-graph-scope full as of today in main), do we set --cuda-graph-scope as attn mlp?

One option is to set --cuda-graph-scope attn mlp as you said, if it's a dense model. A better way is not to set this argument, leaving it the default empty list, this indicates that we capture the whole layer.

In that case, are we doubling the number of cuda graphs as compared to main?

No, we have the assumption that one layer only contains one cudagraph region. When you specify "attn mlp", they will be captured as a whole. Another example is in moe model, if you specify "attn moe_router moe_preprocess", it captures everything from attention to the moe preprocessing part as a whole cudagraph. So specifying "attn moe_preprocess" would fail instead of capturing two separate graphs.

buptzyb avatar Nov 22 '25 03:11 buptzyb

Thanks for the clarification. Is this behavior present for both the local and TE implementation or just for TE? Mcore inference solely uses the local implementation, hence my question.

siddharth9820 avatar Nov 22 '25 05:11 siddharth9820

Thanks for the clarification. Is this behavior present for both the local and TE implementation or just for TE? Mcore inference solely uses the local implementation, hence my question.

If you just leave cuda-graph-scope an empty list, the two implementations work the same - capture the whole layer as one graph. If you set some values for cuda-graph-scope, the only valid value for local implementation is "full_iteration", other values are forbidden (here), since this PR only supports partial cudagraph for TE implementation.

btw I know Jimmy has some efforts to support partial cudagraph for local implementation as well, and that is in the adlr local branch.

buptzyb avatar Nov 23 '25 12:11 buptzyb

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks!

buptzyb avatar Nov 25 '25 01:11 buptzyb

/ok to test e82523232f24385d44b3ca656d8d297ba866cb07

buptzyb avatar Nov 26 '25 07:11 buptzyb

/ok to test 40a89ce0869a5a274e58261a22faaf690d09d19a

buptzyb avatar Nov 26 '25 08:11 buptzyb

/ok to test 9639ab97ce63782790636afe88b83ac39cd3891d

buptzyb avatar Dec 01 '25 12:12 buptzyb

/ok to test 0823434996ce1f0681571a8247824487dd5e3f06

buptzyb avatar Dec 02 '25 11:12 buptzyb

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks!

buptzyb avatar Dec 02 '25 13:12 buptzyb

@lmcafee-nvidia @mathemakitten can you please sign off on this MR?

kvareddy avatar Dec 04 '25 05:12 kvareddy

Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? Thanks!

buptzyb avatar Dec 07 '25 04:12 buptzyb