Yuxuan Xia
Yuxuan Xia
This is the current env-check.sh result  
We cannot reproduce this issue. 2k input memory is always larger than 1k's. If we use Quantized KV Cache, the long sequence's second token latency might outperform the shorter sequence....
I think the pretrained full model is provided in the repo but it is not that obvious. You can check this [link](https://drive.google.com/drive/folders/15wx9vOM0euyizq-M1uINgN0_wjVRf9J3)
We cannot reproduce this issue, in our testing, **W4A16** Baichuan2 7B's peak memory grows with the input sequence when the max output is 512. | | peak mem (GB) |...
> Hello [@FrankLeeeee](https://github.com/FrankLeeeee) , yes, I would like to take this task and I will send out the PR later. Thank you! Hi Zixuan, may I ask what are you...
> Hello [@NovTi](https://github.com/NovTi) , I reviewed your PR, and it should have any conflicts with yours~ I just did an improvement on the grouped_topk logic. Cooooool