buptzyb
buptzyb
This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the DtoH/HtoD/DtoD copy streams into the compute stream in one...
# Description TE allows for passing multiple callables into `make_graphed_callables()` to share cudagraph pool to save memory. However, each cudagraph has its own input and output data buffer. This causes...
# What does this PR do ? main PR #2392 Reuse TE cudagraph static input tensor memory buffer among microbatches. This doesn't reduce memory when cudagraph is running up. But...
# What does this PR do ? main PR #1920. part1 is #1917. This part mainly changes the cuda_graph_scope to a enum structure, and fixes cudagraph UTs. :warning: For major...
dev branch PR #1917 & #2353 . With this PR, `--cuda-graph-scope` in `--cuda-graph-impl=transformer_engine` mode now supports combinations of the six values: 1. `attn`: captures operations in TransformerLayer._forward_attention(). 2. `mlp`: captures...
# What does this PR do ? dev PR #2391 Reuse TE cudagraph static input tensor memory buffer among microbatches. This doesn't reduce memory when cudagraph is running up. But...