cccpr
cccpr
@Tracin llama2-7b int8 weight + int8 kv-cache, bad accuracy int8 weight-only, good accuracy
@Tracin k v seperate scales(per-tensor,static),acc is good. k v merged scales, I will test this case later today.
@Tracin why does tensorrt-LLM have to merge qkv?
> @Tracin k v seperate scales(per-tensor,static),acc is good. k v merged scales, I will test this case later today. k v seperate scales(per-tensor,static),acc is fine. k v merged scales, acc...
> > @Tracin why does tensorrt-LLM have to merge qkv? > > Launching a larger gemm can be more efficient than launching three small kernels. BTW, in smoothquant impletmented by...
@Tracin just use the official example of smoothquant like this: 1. `python hf_llama_convert.py -i /root/models/Llama-2-7b/ -o ./smooth_llama2_7b_alpha_0.5/sq0.5/ -sq 0.5 --tensor-parallelism 1 --storage-type fp16` 2. `python build.py --bin_model_dir /root/TensorRT-LLM/examples/llama/smooth_llama2_7b_alpha_0.5/sq0.5/1-gpu/ --use_gpt_attention_plugin float16...
@Tracin I have tested int8-kv-cache and smoothquant w8a8 respectively on **Llama-1-7b**, both of them got good accuracy( close to fp16 accuracy, about **35.5 on MMLU**), just like what you have...
@Tracin Is the bug fixed?
@Tracin get bin_model: `python hf_llama_convert.py -i /root/models/Llama-2-7b/ -o /root/TensorRT -LLM/examples/llama/llama2_7b_w8_int8_kv_cache/ --calibrate-kv-cache -t fp16` I use the bin file generated by the command above to build a **weight-only-quantize** trt-engine, like this:...
@kaiyux @Shixiaowei02 I noticed that Llama2-70B INT8-SmoothQuant acc drops only about 2% , as described [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md#accuracy) in the official docs. Given the discussions in this issue about Llama2-7B, the acc...