NATTEN
NATTEN copied to clipboard
Attention merging backward
Backward pass for attention merging needs to be handled manually. dQs from different KV branches should just be elementwise added together.
See https://github.com/Dao-AILab/flash-attention/issues/1137