sparseml
sparseml copied to clipboard
[Experimental][StarCode] KV Cache Injection
Feature Description
The results of my experimentation with the tiny_starcoder model.
Findings:
- the original KV cache is being added not as separate arrays:
past_key_values.{attn_block_id}.valuesandpast_key_values.{attn_block_id}.keys, but as a join array of keys and values. Did not get to look into breaking those two down, but by analyzing the onnx graph I do not see why we could not do it - the causal mask for this model has different dimensions than what we usually assume. This could be fixed by adding a node after the
causal_maskinput, that applies the appropriate permutation to the input to patch this.
This is an experimental branch, for which I will, for now, stop the development due to other priorities. To revisit in the future.