Xiaowei Ren

Results 5 issues of Xiaowei Ren

With the latest implementation of Latency Hiding Scheduling, we observe that most weight gradient all-reduce latency is still exposed. [(ref slide 6 and 7 at here)](https://docs.google.com/presentation/d/1s2B4DPuhOVQbJ4SAZA7XWBKL5ST-Dfcn/edit#slide=id.g1895a52e93e_0_0) Here is a brief...

By default, layout assignment tries to assign a layout to transposes that make them a bitcast. This layout is then propagated inside the HloComputation, which means if it does not...

# What does this PR do ? Remove unnecessary attention masks. [Related MCore MR is here.](https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1259)

core
NLP

# Description This is a CP implementation variant with KV all-gather. Currently, it can support: - sliding window attention + causal + FlashAttention - full window attention + causal +...

# Description This PR adds a hierarchical implementation of context parallelism to attention. It uses A2A communications in low-level CP groups (e.g., via NVLink), and P2P communications in high-level CP...