Kai Huang

Results 136 comments of Kai Huang

Fixed in: https://github.com/intel-analytics/ipex-llm/pull/10566

Hi, https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary in this page seems the memory is for 1 token in and 2048/8192 token out. We will reproduce this result and update our results here.

For longer sequence input (1k/2k/4k,...), bigdl-llm uses larger memory than the official model. We will look into this.

Hi, sorry for the late reply. One difference is that for the official int4 model, it is using w4a16, but previously when you run with ipex-llm, we are using w4a32,...

Hi @sriraman2020 Sorry that Mixtral multi-gpu inference is currently not supported. We will update in this issue if this is supported in the future.

Hi @AmberXu98 Do you encounter this error `Native API returns: -5 (PI_ERROR_OUT_OF_RESOURCES)` meaning out of memory? Also could you provide more information of your settings so that we can better...

Hi @AmberXu98 We are reproducing with your prompt again to double confirm it in our environment. By the way, want to confirm is it the case that if you input...

> > Hi @AmberXu98 > > We are reproducing with your prompt again to double confirm it in our environment. > > By the way, want to confirm is it...

@leonardozcm Please take a look.

[chat.txt](https://github.com/intel-analytics/BigDL/files/13948810/chat.txt) Our code to test qwen with context. For 14B model, seems now we can only run one round of chat. https://github.com/xusenlinzy/api-for-open-llm/tree/master customer code