feat(MoE): Refactor cuda_graph_scope
dev branch PR #1917 & #2353 .
With this PR, --cuda-graph-scope in --cuda-graph-impl=transformer_engine mode now supports combinations of the six values:
attn: captures operations in TransformerLayer._forward_attention().mlp: captures operations in TransformerLayer._forward_mlp() for a dense layer.moe: captures operations in TransformerLayer._forward_mlp() for a MoE layer.moe_router: captures operations in TransformerLayer._forward_mlp() up to MoELayer.router(). Note that if shared experts overlap is not used, it also captures the shared experts.moe_preprocess: captures operations in MoELayer.preprocess(). Must be used together withmoe_router.mamba: captures the mamba layer.
- Example 1:
For a dense model, set --cuda-graph-scope attn mlp to capture the whole Transformer layer, or set --cuda-graph-scope attn to capture the attention part, or set --cuda-graph-scope mlp to capture the mlp part. The non-graphed part will go to the normal pass.
- Example 2:
For a moe model, set --cuda-graph-scope attn moe_router moe_preprocess to capture operations from the beginning of the Transformer layer to the preprocess method in the moe token dispatcher. However, if you are using alltoall dispatcher with drop-no-padding or router-padding-for-fp8 options, you can only set --cuda-graph-scope attn moe_router to capture up to the moe router because there will be cuda synchronization in the preprocess method. Finally, if you are using alltoall dispatcher with drop-padding, then you can directly set --cuda-graph-scope attn moe to capture the whole layer as its sync-free.
- Example 3:
For a model that contains both dense and moe layers like DeepSeek, the dense layer will check for "mlp" and the moe layer will check for "moe*". For example, setting --cuda-graph-scope attn mlp moe_router moe_preprocess captures the whole (attn+mlp) dense layer, and the attn+router+preprocess part of the moe layer.
- Example 4:
Mamba model is different from a traditional TransformerLayer-based model. The mamba, attention, mlp, moe are all in different layer objects. For a mamba+moe model, you can set --cuda-graph-scope mamba attn moe_router to capture the corresponding layers. Or you can also set --cuda-graph-scope attn moe_router if you don't want the mamba layers to be graphed.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test 3b85592a59cc24785baa41ae8d343fec87311344
/ok to test 97e09403b38049f2007fed74e8f168dd6d8e984a
/ok to test 770b510b59662610321525996ebbda4c1f12af41
/ok to test ebb0e9d9b74a698abdcf7d99d01721fb859383ba
@jiemingz @mathemakitten @sidsingh-nvidia @lmcafee-nvidia Can you please take a look at this MR?
/ok to test c07462349a86e33af9211ccb44a13ba832683be5
@jiemingz @mathemakitten @sidsingh-nvidia can you please take a look at this MR?
/ok to test a35489f3ae2670f5b10dda306ead45b4cf814488
@sidsingh-nvidia @lmcafee-nvidia @santhnm2 can you please take a look at this MR?
All my outstanding questions/comments re: cudagraphs are resolved. I will defer formal approval on behalf of @NVIDIA/inference in case anyone else wants to weigh in before merge.
All my outstanding questions/comments re: cudagraphs are resolved. I will defer formal approval on behalf of @NVIDIA/inference in case anyone else wants to weigh in before merge.
Thanks! @kvareddy @sidsingh-nvidia @lmcafee-nvidia @santhnm2 could you help review too?
/ok to test 0337f2053bafd3affb50a56a0d01c8650141b98d
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? Thanks!
If one wants to use per-layer cuda-graphs (--cuda-graph-scope full as of today in main), do we set --cuda-graph-scope as attn mlp? In that case, are we doubling the number of cuda graphs as compared to main?
If one wants to use per-layer cuda-graphs (--cuda-graph-scope full as of today in main), do we set --cuda-graph-scope as
attn mlp?
One option is to set --cuda-graph-scope attn mlp as you said, if it's a dense model. A better way is not to set this argument, leaving it the default empty list, this indicates that we capture the whole layer.
In that case, are we doubling the number of cuda graphs as compared to main?
No, we have the assumption that one layer only contains one cudagraph region. When you specify "attn mlp", they will be captured as a whole. Another example is in moe model, if you specify "attn moe_router moe_preprocess", it captures everything from attention to the moe preprocessing part as a whole cudagraph. So specifying "attn moe_preprocess" would fail instead of capturing two separate graphs.
Thanks for the clarification. Is this behavior present for both the local and TE implementation or just for TE? Mcore inference solely uses the local implementation, hence my question.
Thanks for the clarification. Is this behavior present for both the local and TE implementation or just for TE? Mcore inference solely uses the local implementation, hence my question.
If you just leave cuda-graph-scope an empty list, the two implementations work the same - capture the whole layer as one graph. If you set some values for cuda-graph-scope, the only valid value for local implementation is "full_iteration", other values are forbidden (here), since this PR only supports partial cudagraph for TE implementation.
btw I know Jimmy has some efforts to support partial cudagraph for local implementation as well, and that is in the adlr local branch.
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks!
/ok to test e82523232f24385d44b3ca656d8d297ba866cb07
/ok to test 40a89ce0869a5a274e58261a22faaf690d09d19a
/ok to test 9639ab97ce63782790636afe88b83ac39cd3891d
/ok to test 0823434996ce1f0681571a8247824487dd5e3f06
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of NVIDIA/hybrid-mamba? @kvareddy @santhnm2 could you help review? Thanks!
@lmcafee-nvidia @mathemakitten can you please sign off on this MR?
Hi @rogerwaleffe @duncanriach @JRD971000 could you help review on behalf of hybrid-mamba? @jaredcasper @deepakn94 @santhnm2 @ericharper could you help review on behalf of gpt? Thanks!