guyueh1

Results 20 issues of guyueh1

# What does this PR do ? **Making it a option to use thunder to jit-compile the dropout in LoRA adapters; thunder can reduce the memory footprint of this layer...

NLP

# Description **When using `FP8_DPA=1 NVTE_FP8_DPA_BWD=0`, the backprop uses BF16 q/k/v/out tensors and the fp8 q/k/v/o are not used. So we should avoid saving them for backprop, which reduces the...

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

NLP
Run CICD

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

Run CICD

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

Run CICD

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

# What does this PR do ? This is a follow-up after #1569 , to fix the sequence length for PP>1 case. # Issues List issues that this PR closes...

CI:L2

# What does this PR do ? Support virtual pipeline parallel (vpp) in mcore # Issues closes #1038 # Usage * **You can potentially add a usage example below** ```python...

CI:L0