Guancheng Fu comments

Results 36 comments of


                                            Guancheng Fu

Llama-2-7b-chat-hf produces wrong output on CPU

Seems with the same condition, v4.38.0 is correct. I guess there are something missing in our `native_sdp`. Update: the `attention_mask` is always None while doing calculation. I guess this is...

Llama-2-7b-chat-hf produces wrong output on CPU

> I test on transformers v4.38.0, when optimize_model=True, it cannot work due to transformers 4.38+ add `cache_position` parameter to forward. The error is as below: Traceback (most recent call last):...

Llama-2-7b-chat-hf produces wrong output on CPU

Closed as completed~ Thanks @jenniew

Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving.

Hi, I am working to reproduce this issue.

Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving.

Can you post the result of the `offline_inference.py` within your old environment? We fix a bug recently that may cause the generation ends early. So if the generation ends early...

Performance drop for neural-chat 7b with new repo of ipex-llm(2.5.0b20240425) vllm serving.

Can you check if your old environment's vLLM have the following code: https://github.com/analytics-zoo/vllm/blob/sycl_xpu/vllm/worker/model_runner.py#L216 Also, you can try benchmark_throughput to get a more accurate performance estimation: Try follow the instructions here:...