tvm [KVCache] Support passing in attn_score_scaling

[KVCache] Support passing in attn_score_scaling_factor into KV cache

Open rickzx opened this issue 1 year ago • 3 comments

In GPT-2, attention calculation requires an additional feature scale_attn_by_inverse_layer_idx. It provides a scaling factor per attention layer when calculating the attention score, before applying the softmax function.

This PR supports this additional parameter in KV cache.

Feb 19 '24 01:02 rickzx

cc: @MasterJH5574 Will need https://github.com/flashinfer-ai/flashinfer/pull/126 to be merged first:

Feb 19 '24 01:02 rickzx

Given https://github.com/flashinfer-ai/flashinfer/pull/126 has been merged, let's bump 3rdparty/flashinfer to the latest FlashInfer

Feb 19 '24 16:02 MasterJH5574

Also there is a format issue https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-lint/detail/PR-16606/3/pipeline

Feb 19 '24 16:02 MasterJH5574

tvm tvm copied to clipboard

[KVCache] Support passing in attn_score_scaling_factor into KV cache

tvm
tvm copied to clipboard