Megatron-LM
Megatron-LM copied to clipboard
why Megatron's ParallelAttention currently only supports SelfAttention with a 'causal' MaskType?
Could you please explain why Megatron's ParallelAttention currently only supports SelfAttention with a 'causal' MaskType? Also, is there potential for flashAttention support in cases where the Mask is 'None' for both Multihead Self Attention (MSA) and Multihead Cross Attention (MCA)?
In Parallel class,I noticed this:
It appears that the utilization of flash-attention is currently confined to causal self-attention. However, considering that flash-attention is compatible with padding mask types and cross-attention, could you elucidate on any technical challenges or distinctions that might prevent its extension to these contexts?
Marking as stale. No activity in 60 days.
you can use MaskType None when --use-mcore-models