Yuxuan Xia

Results 6 comments of Yuxuan Xia

This is the current env-check.sh result ![env-check2](https://github.com/intel-analytics/ipex-llm/assets/77518229/34510b69-2a58-4a44-aa7c-6b86f4d7b927) ![env-check1](https://github.com/intel-analytics/ipex-llm/assets/77518229/c853d078-d1f2-45e4-a6f1-b1c7060444c1)

We cannot reproduce this issue. 2k input memory is always larger than 1k's. If we use Quantized KV Cache, the long sequence's second token latency might outperform the shorter sequence....

I think the pretrained full model is provided in the repo but it is not that obvious. You can check this [link](https://drive.google.com/drive/folders/15wx9vOM0euyizq-M1uINgN0_wjVRf9J3)

We cannot reproduce this issue, in our testing, **W4A16** Baichuan2 7B's peak memory grows with the input sequence when the max output is 512. | | peak mem (GB) |...

> Hello [@FrankLeeeee](https://github.com/FrankLeeeee) , yes, I would like to take this task and I will send out the PR later. Thank you! Hi Zixuan, may I ask what are you...

> Hello [@NovTi](https://github.com/NovTi) , I reviewed your PR, and it should have any conflicts with yours~ I just did an improvement on the grouped_topk logic. Cooooool