OLMo
OLMo copied to clipboard
Inital GQ Attn Impl, issue w scaled_dot_product_attn
I tried implementing grouped query attention in this PR, but seems that Pytorch's scaled_dot_product_attention doesn't support the kind of broadcasting we'd need for this. Revisit if/when this gets fixed on Pytorch's end.