tvm
tvm copied to clipboard
[KVCache] Support passing in attn_score_scaling_factor into KV cache
In GPT-2, attention calculation requires an additional feature scale_attn_by_inverse_layer_idx
. It provides a scaling factor per attention layer when calculating the attention score, before applying the softmax function.
This PR supports this additional parameter in KV cache.
cc: @MasterJH5574 Will need https://github.com/flashinfer-ai/flashinfer/pull/126 to be merged first:
Given https://github.com/flashinfer-ai/flashinfer/pull/126 has been merged, let's bump 3rdparty/flashinfer to the latest FlashInfer
Also there is a format issue https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-lint/detail/PR-16606/3/pipeline