alexsin368
alexsin368
@chsasank I installed IPEX 2.1.40+xpu with Python 3.11.9, same as you, but am only able to reproduce 1 out of the 2 issues you see. I'm getting 16.72 tflops and...
@chsasank The performance regression needs to be within the scope of IPEX itself for my team and I to continue debugging. Let's figure out whether the regression is indeed to...
Hi @yash3056 please describe your issue in detail and provide the code and steps to reproduce it.
@LeptonWu this issue could be related to https://github.com/intel/intel-extension-for-pytorch/issues/529 and my team members are looking into it.
@Pradeepa99 The release notes mention more support for AWQ format support and it seems it is referring to the usage of ipex.llm.optimize where you can specify the quant_method as 'gptq'...
@Pradeepa99 yes, the testcase example you found is what I meant. IPEX does not have an example similar to the GPTQ one you found. We recommend you to use Intel...
@andyluo7 I will work on reproducing this issue and get back to you with findings. Did you try passing in THUDM/chatglm2-6b directly as the model?
Issue reproduced. What version of transformers are you using? I have 4.37.0. I will be working with the team to resolve your issue.
@andyluo7 I found out what's causing the issue. It's when you pass in --token-latency as an input argument. Take a look at lines 211 and 215: https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_generation.py#L211-L215 For now, try...
@andyluo7 As a workaround for now, modify line 211 in run_generation.py script to _gen_ids = output_ for now and it should work.