int8_kv_cache scale

Open liquanfeng opened this issue 1 year ago • 1 comments

https://github.com/NVIDIA/TensorRT-LLM/blob/89ba1b1a67d570e41b03da87e5518eaff0d31fbf/tensorrt_llm/models/llama/convert.py#L757

I'm puzzled as to why the act_range of q_proj is being calculated in the scale for int8_kv_cache? Because the scale is only used to quantify the output of k_proj and v_proj.

May 07 '24 18:05 liquanfeng

Reassigning to @thorjohnsen

May 16 '25 21:05 poweiw

@liquanfeng , Thanks for catching that! Although the code is still present here, it appears to be specific to the old TensorRT backend. The PyTorch backend, which is the preferred one, seems to be okay, here.

Oct 21 '25 06:10 karljang