buptzyb issues

Results 6 issues of


                                            buptzyb

Allow merging compute-copy streams

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the DtoH/HtoD/DtoD copy streams into the compute stream in one...

size:M

comp:core

Save CUDA Graph memory by reusing input and output tensors

# Description TE allows for passing multiple callables into `make_graphed_callables()` to share cudagraph pool to save memory. However, each cudagraph has its own input and output data buffer. This causes...

[Dev] Optimize TE cudagraph input memory

# What does this PR do ? main PR #2392 Reuse TE cudagraph static input tensor memory buffer among microbatches. This doesn't reduce memory when cudagraph is running up. But...

Expert Review

dev branch

[Dev] feat(MoE): Refactor cuda_graph_scope - part2

# What does this PR do ? main PR #1920. part1 is #1917. This part mainly changes the cuda_graph_scope to a enum structure, and fixes cudagraph UTs. :warning: For major...

Expert Review

dev branch

feat(MoE): Refactor cuda_graph_scope

dev branch PR #1917 & #2353 . With this PR, `--cuda-graph-scope` in `--cuda-graph-impl=transformer_engine` mode now supports combinations of the six values: 1. `attn`: captures operations in TransformerLayer._forward_attention(). 2. `mlp`: captures...

module: moe

Expert Review

core_r0.15.0

Optimize TE cudagraph input memory

# What does this PR do ? dev PR #2391 Reuse TE cudagraph static input tensor memory buffer among microbatches. This doesn't reduce memory when cudagraph is running up. But...