wanzhenchn

Results 8 issues of wanzhenchn

### Motivation I using the scripts to **benchmark s-lora** with lmdeploy 0.2.6 on 2*A30. Firstly I only benchmark the base model lama2-13b-hf, the performance of pytorch backend **is obviously lower**...

假设北京二手房共找到 34539 套朝阳二手房,但是页面只展示了 100 页,每页 30 条记录,程序可以输出 100*30=3000 条记录,对于剩下的 34539 -3000 = 31539 条数据,似乎没法获取?

### Motivation This library https://github.com/mit-han-lab/qserve introduces W4A8KV4 Quantization method, called (https://arxiv.org/abs/2405.04532) as QoQ in the paper, which **delivers performance gains in large-batch** compared to other method (like awq-w4a16). > Quantization...

INT4 quantization only delievers **20%~35%** faster inference performance than FP16 for the LLaMA-13b on single A100 80GB PCIe with batch size 1, 2, 4, 8, 16 for prefill_length, decode length...