Quanfeng Li

Results 2 issues of Quanfeng Li

https://github.com/NVIDIA/TensorRT-LLM/blob/89ba1b1a67d570e41b03da87e5518eaff0d31fbf/tensorrt_llm/models/llama/convert.py#L757 I'm puzzled as to why the `act_range` of `q_proj` is being calculated in the scale for int8_kv_cache? Because the scale is only used to quantify the output of `k_proj`...

triaged
Investigating
KV-Cache Management

The `WeightOnlyQuantRowLinear` module was missing the `is_expert` parameter, which caused MoE models like Deepseek 2/3 and Mixtral to perform unnecessary `allreduce` operations during INT8 weight-only quantization. This issue resulted in...

triaged
Community want to contribute