Andy Arditi
Andy Arditi
Having written out the alternative solution of having a cached attention mask that grows as needed, I'm thinking maybe that's better.. It does have the following drawback: if you run...
I ran the following benchmarks to measure perf impact. The difference in perf doesn't seem significant to me, so I think this simple implementation seems ok. Let me know if...
Hi @bryce13950 - thanks for pinging on this. The currently-implemented solution in this PR is to construct attention masks for each attention component (i.e. at each layer) on-the-fly. This solution...
abandoning this pr
Just noting here that Yi models (both [6B](https://huggingface.co/01-ai/Yi-6B/blob/7ed3ea6ea9c05020e2fd0cd1cc2916921a369d7c/config.json#L15) and [34B](https://huggingface.co/01-ai/Yi-34B/blob/48ef127f218826a38e0dc0aebea9505e8302a842/config.json#L15)) use grouped-query attention (`num_key_value_heads` < `num_attention_heads`). Grouped-query attention is implemented in #443, so this integration should be straightforward once that...