openvino_notebooks
openvino_notebooks copied to clipboard
Unable to run qwen2-vl inference on Intel integrated GPU. Works fine with CPU
Hi,
I'm unable to run inference on Intel Integrated GPU on qwen2-vl-2B model. It works fine if I select CPU as the device.
The exception happens with error code "-5"
The stack trace:
File "ov_qwen2_vl.py", line 763, in forward self.request.wait() RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245: Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54: Caught exception: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_stream.cpp:365: [GPU] clFlush, error code: -5
System details:
OS:
Operating System: Ubuntu 22.04.5 LTS Kernel: Linux 6.8.0-52-generic
GPU: description: VGA compatible controller product: HD Graphics 630 vendor: Intel Corporation
import openvino as ov core = ov.Core() print(core.available_devices)
['CPU', 'GPU']
Code snapshot
from pathlib import Path import requests
from ov_qwen2_vl import model_selector
model_id = model_selector()
print(f"Selected {model_id.value}") pt_model_id = model_id.value model_dir = Path(pt_model_id.split("/")[-1])
from ov_qwen2_vl import convert_qwen2vl_model
uncomment these lines to see model conversion code
convert_qwen2vl_model??
import nncf
compression_configuration = { "mode": nncf.CompressWeightsMode.INT4_ASYM, "group_size": 128, "ratio": 1.0, }
convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration)
from ov_qwen2_vl import OVQwen2VLModel
Uncomment below lines to see the model inference class code
OVQwen2VLModel??
from notebook_utils import device_widget
device = device_widget(default="AUTO", exclude=["NPU"])
model = OVQwen2VLModel(model_dir, device.value)
print(device.value)
from PIL import Image from transformers import AutoProcessor, AutoTokenizer from qwen_vl_utils import process_vision_info from transformers import TextStreamer
min_pixels = 256 * 28 * 28 max_pixels = 1280 * 28 * 28 processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)
if processor.chat_template is None: tok = AutoTokenizer.from_pretrained(model_dir) processor.chat_template = tok.chat_template
example_image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" example_image_path = Path("demo.jpeg")
if not example_image_path.exists(): Image.open(requests.get(example_image_url, stream=True).raw).save(example_image_path)
image = Image.open(example_image_path) question = "Describe this image."
messages = [ { "role": "user", "content": [ { "type": "image", "image": f"file://{example_image_path}", }, {"type": "text", "text": question}, ], } ]
Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", )
#display(image) print("Question:") print(question) print("Answer:")
generated_ids = model.generate(**inputs, max_new_tokens=100, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))
From an internet search for OpenCL error codes I found for instance this: "https://streamhpc.com/blog/2013-04-28/opencl-error-codes/", indicating that "-5" could mean:
-5 | CL_OUT_OF_RESOURCES | Â | if there is a failure to allocate resources required by the OpenCL implementation on the device.
Could you provide more information about your system, please? Which exact CPU, what amount of total system memory? When you started the code, could you monitor the system memory usage, please? Do you see if the amount of used system memory wants to consume the total system memory?
I monitored memory usage with CPU and GPU options. Here is the screen shot of "htop". As the integrated 630 Intel GPU doesn't have it's own memory and uses system memory, I used htop to monitor the system memory.
Inference on GPU: The system memory and swap looks to be hitting the limits.
Inference on CPU: System memory and swap looks fine..
So, not sure why memory usage increases for GPU inference (with the model being same)? Looks like a bug.
Also, when I ran multiple time with GPU inference, it failed with different error couple of times. Here is the stack trace:
res = self.image_embed_merger([hidden_states, causal_mask, rotary_pos_emb])[0]
File "/usr/local/lib/python3.10/dist-packages/openvino/_ov_api.py", line 427, in call return self._infer_request.infer( File "/usr/local/lib/python3.10/dist-packages/openvino/_ov_api.py", line 171, in infer return OVDict(super().infer(_data_dispatch( RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:223: Exception from src/plugins/intel_gpu/src/runtime/ocl/ocl_event.cpp:56: [GPU] clWaitForEvents, error code: -14
Here is the CPU and Memory details of my system:
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz CPU family: 6 Model: 158 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 9 CPU max MHz: 3300.0000 CPU min MHz: 800.0000 BogoMIPS: 5399.81
Memory:
RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000dfffffff 3.5G online yes 0-27 0x0000000100000000-0x000000021fffffff 4.5G online yes 32-67
Memory block size: 128M Total online memory: 8G Total offline memory: 0B
Do you run the code (code snapshot shown above) in a Python script or as a Jupyter notebook (in the browser)? Have you changed something in the code when it is based on a Juypiter notebook from this repo? Have you used "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb" or "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-audio/qwen2-audio.ipynb" as a base?
Can you provide more details about the compression- and conversion-settings, please?
For processing the model in the GPU, the model first is converted into OpenCL kernels (including optimizations, introducing different operations where needed) - so after OpenCL compilation the model could behave different, could consume a different amount of memory compared to the model used by the CPU-plugin.
For GPU the model typically is used in FP16 precission - where for CPU, depending on the model and generation, INT8, INT4, FP16 or FP32 "just works". You might need to delete the local models and start conversion and compression again, but with FP16 for running on GPU - and in parallel you could convert and compress the model in additional precissions for the CPU.
What is the memory consumption before starting the code? Could you try closing other applications to free memory and try again?
Do the errors/exception only occur during inference, or also during conversion and compression?
Thank you for your detailed answer. I used the methods in https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/ov_qwen2_vl.py for compression and conversion of the models for GPU/CPU. i.e
=== compression_configuration = { "mode": nncf.CompressWeightsMode.INT4_ASYM, "group_size": 128, "ratio": 1.0, }
convert_qwen2vl_model(pt_model_id, model_dir, compression_configuration) .... device = device_widget(default="AUTO", exclude=["NPU"])
model = OVQwen2VLModel(model_dir, device.value)
===== You can look at the code snap shot I had sent earlier for the detailed code.
I ran this in a python script and not via jupyter notebook. I ran into issues with the code link you have sent with the "optimum_cli" hung on my system (I didn't debug this further).
Are you pointing that I'm using a wrong code? I'm new to llm code, so any help will be great!
NOTE: I had stopped all applications and seen that CPU and Memory usage were pretty low before starting my script.
I'm not sure whether I have provided all the information you needed. Let me know, if you need more clarifications in further helping :-)
The error happens only during inference.
Which of the two links have you tried, visual-language or audio?
You probably haven't seen optimum_cli hung, but just taking very long time......... Conversion, quantization and compression can take very, very long time, especially when running short on system memory during the process (and starting to swap to HDD/SSD).
For GPU, try using FP16 or FP32 instead of INT4 or INT8.
visual-language.
Should I use the current code (snapshot attached) for changing it to FP16? OR you want me to try the optimum_cli code?
Thanks
Now using a new environment (deleted my previous python virtual environment), synchronized openvino_notebooks new, installed requirements new and started the Jupyter notebook "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb". Using a Laptop, MS-Win11, Python 3.12.4. Using default values in the drop-down fields ("Qwen/Qwen2-VL-2B-Instruct"), including to use INT4. Downloading the model, 4-bit compression took really long, system memory consumption was really high. Changed inference device from AUTO to GPU. And with using the "cat.png" I successfully get this result: "In the image, there is a cat lying inside a cardboard box. The cat has a fluffy coat and is lying on its back with its paws up. The box is placed on a light-colored carpet, and the background shows a portion of a white couch and a window with curtains. The lighting in the room is bright, suggesting it is daytime. The cat appears to be relaxed and comfortable in the box."
(Selecting "Computer" in MS-Win TaskManager)
(if you use e.g.
intel_gpu_top under Linux to measure the GPU load, you would need to look into the "Render/3D" section showing ExecutionUnit load due to executing OpenCL kernels doing the inference)
Do you have a chance to test the original Juypter notebook as a consistency check?
At some point the OpenVINO dev team need to jump in - they might ask you for more lower-level driver information like OpenCL version, in case your SoC with integraded iGPU needs a specific version; your SoC "Intel(R) Core(TM) i5-7500T " is an older model, where I used a newer model (Intel Core Ultra 7 155H).
Good to see it's working on MS-Win11. I'm running this on Ubuntu 22.04.
I tried "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb" this with a python script on a 16GB ubuntu system and it failed with memory issues.. Looks like I need to run on a server with larger memory.
If there is enough HDD/SSD storage, then the operation system should start swapping when running short on memory (if there is something to swap, other than memory content needed for conversion/compresion/inference). Could also be a driver issue.
Have you tried to run other notebooks/scripts doing inference in the GPU - just to check if the environment and drivers are consistent?
I think there are 2 problems I'm facing, with 2 different codes for inferencing qwen2-vl models with openvino.
-
First, the code that I have attached in the beginning of this ticket, which uses the APIs in ov_qwen2_vl.py. a. The initial issue was with "CL resource error -5" on a 8GB RAM system (during inference). This was only with GPU device. Inference on CPU worked fine. b. On a 16GB system, I don't see memory as an issue when I run this code. But the script hangs/get stuck at "model.generate(**inputs, max_new_tokens=128)" forever.
-
Now the second problem, is with the code you had pointed ""https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb". a. I see issue with model compression code failing due to low memory (both on 8GB abd 16GB systems).
So, the first answer I'm looking for is, which is the right code to pursue further.
Second answer I'm looking for is, has anyone tried any of the above code on Ubuntu 22.04 running on 16GB servers?
Third answer I'm looking for is, what is the recommended hardware configuration for running this on Ubuntu 22.04 servers?
I want to thank for all the clarifications I have got so far. Hopefully the above answers will help me to progress further and debug this issue further for closure. Thanks!
I think there are 2 problems I'm facing, with 2 different codes for inferencing qwen2-vl models with openvino.
- First, the code that I have attached in the beginning of this ticket, which uses the APIs in ov_qwen2_vl.py. a. The initial issue was with "CL resource error -5" on a 8GB RAM system (during inference). This was only with GPU device. Inference on CPU
This is difficult to comment. Could be too less memory. Could be a driver-version conflict. Could be too old SoC. Could be the source code of the script (which behaves differently than the Jupyter notebook you mentioned under 2a).
worked fine. b. On a 16GB system, I don't see memory as an issue when I run this code. But the script hangs/get stuck at "model.generate(**inputs, max_new_tokens=128)" forever.
It could "normally" take very, very, really very long - even several minutes.
The Jupyter notebook uses config.max_new_tokens = 100, can you try with smaller values than 128 (try something extreme like 5, or 10, 50, 100)?
- Now the second problem, is with the code you had pointed ""https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/qwen2-vl/qwen2-vl.ipynb". a. I see issue with model compression code failing due to low memory (both on 8GB abd 16GB systems).
Which error messages do you get? Is it with "compression"? Could you skip the compression, and only convert to FP16/FP32?
So, the first answer I'm looking for is, which is the right code to pursue further.
Second answer I'm looking for is, has anyone tried any of the above code on Ubuntu 22.04 running on 16GB servers?
Could you find a machine, compress&convert&quantize the models on it and copy to your resource-constraint machine?
Third answer I'm looking for is, what is the recommended hardware configuration for running this on Ubuntu 22.04 servers?
I want to thank for all the clarifications I have got so far. Hopefully the above answers will help me to progress further and debug this issue further for closure. Thanks!