cccpr comments

Results 67 comments of


                                            cccpr

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin llama2-7b int8 weight + int8 kv-cache， bad accuracy int8 weight-only， good accuracy

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin k v seperate scales(per-tensor，static)，acc is good. k v merged scales, I will test this case later today.

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin why does tensorrt-LLM have to merge qkv?

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

> @Tracin k v seperate scales(per-tensor，static)，acc is good. k v merged scales, I will test this case later today. k v seperate scales(per-tensor，static)，acc is fine. k v merged scales, acc...

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

> > @Tracin why does tensorrt-LLM have to merge qkv? > > Launching a larger gemm can be more efficient than launching three small kernels. BTW, in smoothquant impletmented by...

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin just use the official example of smoothquant like this: 1. `python hf_llama_convert.py -i /root/models/Llama-2-7b/ -o ./smooth_llama2_7b_alpha_0.5/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type fp16` 2. `python build.py --bin_model_dir /root/TensorRT-LLM/examples/llama/smooth_llama2_7b_alpha_0.5/sq0.5/1-gpu/ --use_gpt_attention_plugin float16...

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin I have tested int8-kv-cache and smoothquant w8a8 respectively on **Llama-1-7b**, both of them got good accuracy( close to fp16 accuracy, about **35.5 on MMLU**), just like what you have...

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin Is the bug fixed?

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@Tracin get bin_model: `python hf_llama_convert.py -i /root/models/Llama-2-7b/ -o /root/TensorRT -LLM/examples/llama/llama2_7b_w8_int8_kv_cache/ --calibrate-kv-cache -t fp16` I use the bin file generated by the command above to build a **weight-only-quantize** trt-engine, like this:...

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight

@kaiyux @Shixiaowei02 I noticed that Llama2-70B INT8-SmoothQuant acc drops only about 2% , as described [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#accuracy) in the official docs. Given the discussions in this issue about Llama2-7B, the acc...