ipex-llm with vllm failed to run on Core ultra 7 165H iGPU
Platform: Core ultra 7 165H iGPU
Model: Qwen/Qwen2-7B-Instruct
Following the steps on https://testbigdldocshane.readthedocs.io/en/perf-docs/doc/LLM/Quickstart/vLLM_quickstart.html#
when running python offline_inference.py, error would ocurr:
(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/vllm$ python offline_inference.py
/home/user/vllm_ipex_env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/user/vllm_ipex_env/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-08-19 13:40:32,185 - INFO - intel_extension_for_pytorch auto imported
WARNING 08-19 13:40:32 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 08-19 13:40:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/user/Qwen2-7B-Instruct', tokenizer='/home/user/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-19 13:40:32 attention.py:71] flash_attn is not found. Using xformers backend.
2024-08-19 13:40:34,255 - INFO - Converting the current model to sym_int4 format......
2024-08-19 13:40:34,255 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-08-19 13:40:38,071 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 08-19 13:40:40 model_convert.py:249] Loading model weights took 4.5222 GB
LLVM ERROR: Diag: aborted
LIBXSMM_VERSION: main_stable-1.17-3651 (25693763) LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 165H] Registry and code: 13 MB Command: python offline_inference.py Uptime: 19.974422 s Aborted (core dumped)
I've also tried the whole process on Data center dGPU flex, which works fine, wondering if this issue only occurs on iGPU.
We haven't tested it on mtl iGPU before. I tried to reproduce it but encountered a different error. Maybe You can try it on docker according to this docker guide.
Tried docker, still got error, do you have plans for iGPU support?
root@user-Meteor-Lake-Client-Platform:/llm# python vllm_offline_inference.py
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-08-20 12:50:38,388 - INFO - intel_extension_for_pytorch auto imported
WARNING 08-20 12:50:38 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 08-20 12:50:38 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/llm/Qwen2-7B-Instruct', tokenizer='/llm/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-20 12:50:38 attention.py:71] flash_attn is not found. Using xformers backend.
2024-08-20 12:50:46,567 - INFO - Converting the current model to sym_int4 format......
2024-08-20 12:50:46,568 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
[2024-08-20 12:50:46,802] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect)
2024-08-20 12:50:51,297 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 08-20 12:50:53 model_convert.py:249] Loading model weights took 4.5222 GB
error: Traceback (most recent call last):
File "/llm/vllm_offline_inference.py", line 48, in
Hi, I have verified that vLLM works on iGPU with model chatglm3-6b on Linux and does not encounter the problem you mentioned in the thread.
The vLLM we provided does have a problem that are related to Qwen2-7B-Instruct but should not report the error in the first thread.
Can you provide the result of https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/scripts/env-check.sh?
Besides, we will check if we can fix the problem that are related to Qwen2 series model.
Hi, Thanks for the reply, The previous environment is currently not available now, so I've installed the recent one to try on both chatglm3-6b and qwen2-7b-instruct, got the same error msg like below.
(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/vllm$ python offline_inference.py
/home/user/vllm_ipex_env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/user/vllm_ipex_env/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-08-22 16:28:32,793 - INFO - intel_extension_for_pytorch auto imported
WARNING 08-22 16:28:32 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 08-22 16:28:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/user/Qwen2-7B-Instruct', tokenizer='/home/user/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-22 16:28:33 attention.py:71] flash_attn is not found. Using xformers backend.
2024-08-22 16:28:34,887 - INFO - Converting the current model to sym_int4 format......
2024-08-22 16:28:34,888 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-08-22 16:28:38,493 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 08-22 16:28:41 model_convert.py:257] Loading model weights took 4.5222 GB
Traceback (most recent call last):
File "/home/user/vllm/offline_inference.py", line 48, in
Below is the result of running env-chech.sh
(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/ipex-llm/python/llm/scripts$ ./env-check.sh
PYTHON_VERSION=3.10.12
transformers=4.37.0
torch=2.1.0a0+cxx11.abi
ipex-llm Version: 2.1.0b20240821
ipex=2.1.10+xpu
CPU Information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 22 On-line CPU(s) list: 0-21 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) Ultra 7 165H CPU family: 6 Model: 170 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 4 CPU max MHz: 5000.0000 CPU min MHz: 400.0000 BogoMIPS: 6144.00
Total CPU Memory: 62.4902 GB
Operating System: Ubuntu 22.04.4 LTS \n \l
Linux user-Meteor-Lake-Client-Platform 6.7.1-060701-generic #202401201133 SMP PREEMPT_DYNAMIC Sat Jan 20 11:43:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
CLI: Version: 1.2.38.20240718 Build ID: 0db09695
Service: Version: 1.2.38.20240718 Build ID: 0db09695 Level Zero Version: 1.16.0
Driver Version 2023.16.12.0.12_195853.xmain-hotfix Driver Version 2023.16.12.0.12_195853.xmain-hotfix Driver UUID 32342e32-362e-3330-3034-392e36000000 Driver Version 24.26.30049.6
Driver related package version: ii intel-level-zero-gpu 1.3.30049.6 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
igpu detected [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.26.30049.6] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.30049]
xpu-smi is properly installed.
+-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel(R) Arc(TM) Graphics | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0200-0000-00087d558086 | | | PCI BDF Address: 0000:00:02.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ GPU0 Memory size=16M
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
DeviceName: To Be Filled by O.E.M.
Subsystem: Intel Corporation Device 2212
Flags: bus master, fast devsel, latency 0, IRQ 214
Memory at 601a000000 (64-bit, prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=256M]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities:
Hi, please try ipex-llm[xpu]==2.1.0.
There is a new feature that breaks the vLLM for the version 2.1.0b20240821
Also, the 7b model might be too big. Qwen2-1.5b-Instruct might be better.
Hi, tried Qwen2-1.5b-Instruct and chatglm3-6b, both worked. Qwen2-7B-Instruct got stucked when loading the model. I've tried Qwen2-7B before using ipex-llm(not with vllm), worked fine. Is the Size limit thing only occur in vLLM?
Hi, tried Qwen2-1.5b-Instruct and chatglm3-6b, both worked. Qwen2-7B-Instruct got stucked when loading the model. I've tried Qwen2-7B before using ipex-llm(not with vllm), worked fine. Is the Size limit thing only occur in vLLM?
There is no size limit in vLLM. Currently, I am not very sure about why Qwen2-7B-Instruct got stuck. My guess it stucks at moving model from cpu to GPU