TensorRT-LLM
TensorRT-LLM copied to clipboard
int8_kv_cache scale
https://github.com/NVIDIA/TensorRT-LLM/blob/89ba1b1a67d570e41b03da87e5518eaff0d31fbf/tensorrt_llm/models/llama/convert.py#L757
I'm puzzled as to why the act_range of q_proj is being calculated in the scale for int8_kv_cache? Because the scale is only used to quantify the output of k_proj and v_proj.
Reassigning to @thorjohnsen