turboderp
turboderp
# What does this PR do? - Reverses the order of global and sliding attention layers in Gemma2. This brings it in line with [Google's implementation](https://github.com/google/gemma_pytorch/blob/1814f8d0a6ba93b875c46a64e6ad1873df448eef/gemma/config.py#L118) in which sliding attention...
The `mha_fwd_kvcache` function contains this GQA optimization that triggers whenever `seqlen_q` is 1, with a few other conditions: ```c++ // Faster to transpose q from (b, 1, (nheads_kv ngroups), d)...
Are there any plans to add more options for `pos_encoding_mode`? Currently `"LLAMA"` works for Llama3.1+ models but the embeddings are subtly incorrect and accuracy suffers a bit.