Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
Checking we use fused kernels to compute scaled masked softmax on prefix lm
- Related to: #209
Basically re-opening the PR as it seems to pass locally but not CI.