buptzyb

Results 6 issues of buptzyb

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the DtoH/HtoD/DtoD copy streams into the compute stream in one...

size:M
comp:core

# Description TE allows for passing multiple callables into `make_graphed_callables()` to share cudagraph pool to save memory. However, each cudagraph has its own input and output data buffer. This causes...

# What does this PR do ? main PR #2392 Reuse TE cudagraph static input tensor memory buffer among microbatches. This doesn't reduce memory when cudagraph is running up. But...

Expert Review
dev branch

# What does this PR do ? main PR #1920. part1 is #1917. This part mainly changes the cuda_graph_scope to a enum structure, and fixes cudagraph UTs. :warning: For major...

Expert Review
dev branch

dev branch PR #1917 & #2353 . With this PR, `--cuda-graph-scope` in `--cuda-graph-impl=transformer_engine` mode now supports combinations of the six values: 1. `attn`: captures operations in TransformerLayer._forward_attention(). 2. `mlp`: captures...

module: moe
Expert Review
core_r0.15.0

# What does this PR do ? dev PR #2391 Reuse TE cudagraph static input tensor memory buffer among microbatches. This doesn't reduce memory when cudagraph is running up. But...