ipex-llm
ipex-llm copied to clipboard
Running vLLM service benchmark(4xARC770) with Qwen1.5-32B-Chat model failed
Environment:
Platform: 6548N+4ARC770
Docker Image: intelanalytics/ipex-llm-serving-xpu:2.1.0
servicing script:
Error info:
1.With Dtype SYM_INT4 could succeed.
2.With Dtype FP8 failed with concurrency>=4. No error for concurrency 1 and 2.
2.GPU card 0 shows N/A utilization, card 1 2 3 work well:
3.Servicing side error log:
4.Client error info:
It seems that the gpu-memory-utilization is too high and causing the card 1 OOM when first_token is computed. You can reduce it to 0.85 can try ir again.
fixed