Running vLLM service benchmark(4xARC770) with Qwen1.5-32B-Chat model failed

Open dukelee111 opened this issue 1 year ago • 1 comments

Environment: Platform: 6548N+4ARC770 Docker Image: intelanalytics/ipex-llm-serving-xpu:2.1.0 servicing script:

Error info: 1.With Dtype SYM_INT4 could succeed. 2.With Dtype FP8 failed with concurrency>=4. No error for concurrency 1 and 2. 2.GPU card 0 shows N/A utilization, card 1 2 3 work well: 3.Servicing side error log: 4.Client error info:

Aug 29 '24 02:08 dukelee111

It seems that the gpu-memory-utilization is too high and causing the card 1 OOM when first_token is computed. You can reduce it to 0.85 can try ir again.

Aug 29 '24 02:08 hzjane

fixed

Dec 11 '24 06:12 glorysdj