Quanfeng Li issues

Repositories
Issues
Comments

Results 2 issues of


                                            Quanfeng Li

int8_kv_cache scale

https://github.com/NVIDIA/TensorRT-LLM/blob/89ba1b1a67d570e41b03da87e5518eaff0d31fbf/tensorrt_llm/models/llama/convert.py#L757 I'm puzzled as to why the `act_range` of `q_proj` is being calculated in the scale for int8_kv_cache? Because the scale is only used to quantify the output of `k_proj`...

triaged

Investigating

KV-Cache Management

fix: WeightOnlyQuantRowLinear

The `WeightOnlyQuantRowLinear` module was missing the `is_expert` parameter, which caused MoE models like Deepseek 2/3 and Mixtral to perform unnecessary `allreduce` operations during INT8 weight-only quantization. This issue resulted in...

triaged

Community want to contribute