Megatron-LM why Megatron's ParallelAttention currently only supports SelfAttention with a 'causal' MaskType?

why Megatron's ParallelAttention currently only supports SelfAttention with a 'causal' MaskType?

Open chenfengshijie opened this issue 11 months ago • 2 comments

Could you please explain why Megatron's ParallelAttention currently only supports SelfAttention with a 'causal' MaskType? Also, is there potential for flashAttention support in cases where the Mask is 'None' for both Multihead Self Attention (MSA) and Multihead Cross Attention (MCA)?

In Parallel class,I noticed this:

It appears that the utilization of flash-attention is currently confined to causal self-attention. However, considering that flash-attention is compatible with padding mask types and cross-attention, could you elucidate on any technical challenges or distinctions that might prevent its extension to these contexts?

Mar 06 '24 02:03 chenfengshijie

Marking as stale. No activity in 60 days.

May 05 '24 18:05 github-actions[bot]

you can use MaskType None when --use-mcore-models

May 06 '24 22:05 ethanhe42

Megatron-LM Megatron-LM copied to clipboard

why Megatron's ParallelAttention currently only supports SelfAttention with a 'causal' MaskType?

Megatron-LM
Megatron-LM copied to clipboard