Megatron-LM
Megatron-LM copied to clipboard
Speed up the creation of attention mask
Prefer to use the inplace variant of triu_/tril_ because they are faster than the out-of-place variants since torch 2.3.0 (https://github.com/pytorch/pytorch/pull/115013).
generally, mask will be created inside transformer engine if --use-mcore-models
Marking as stale. No activity in 60 days.