equinox
equinox copied to clipboard
Lots of improvements to attention
- Support for autoregressive attention;
- Includes support for zero-length queries, e.g. when populating the caches for the prompt.
- Causal masking available by passing mask="causal";
- Support for multi-query attention.
Still TODO:
- support biases, not just masks.
- interpolate between MHA and MQA
- have KV caching not push elements backwards at the end.
- ~cast softmax to float32~ [Done elsewhere!]