[OpenVINO] Enable GPU support for OpenVINO vLLM backend
These changes add GPU device support for OpenVINO vLLM backend
- Added
VLLM_OPENVINO_DEVICEenvironment variable for OpenVINO device selection - Updated GPU-related components in OpenVINO backend (KV cache shapes, swap capability, model profiling run etc)
- Updated OpenVINO version to 2024.4 RC1 in dependencies
- Updated installation instructions
Some performance measurements obtained for Intel ARC A770 (16GB) for 1000 prompts from ShareGPT_V3_unfiltered_cleaned_split dataset:
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
- Add
readylabel to the PR - Enable auto-merge.
🚀
Hi I tried the PR, but new error occurred. I use openvino-gpu to run qwen2-0.5b. It turns out:
Traceback (most recent call last):
File "/workspace/vllm/vllm/worker/openvino_worker.py", line 302, in determine_num_available_blocks
kv_cache_size = self.profile_run()
File "/workspace/vllm/vllm/worker/openvino_worker.py", line 549, in profile_run
model_profile_run()
File "/workspace/vllm/vllm/worker/openvino_worker.py", line 538, in model_profile_run
self.model_runner.execute_model(seqs,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/vllm/vllm/worker/openvino_model_runner.py", line 340, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/usr/local/lib/python3.10/dist-packages/nncf/torch/dynamic_graph/wrappers.py", line 146, in wrapped
return module_call(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/vllm/vllm/model_executor/model_loader/openvino.py", line 164, in forward
self.ov_request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Check '!exceed_allocatable_mem_size' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:139:
[GPU] Exceeded max size of memory object allocation: requested 19914555392 bytes, but max alloc size supported by device is 1073741824 bytes.Please try to reduce batch size or use lower precision.
19914555392 bytes equals 18.5GB. That's strange. I tried some solutions, but they didn't solve my problem. Any solution or hint?
FYI: I use gpu-version vllm to run qwen2-1.5b, which uses ~8GB by showing nvidia-smi.
Hi I tried the PR, but new error occurred. I use openvino-gpu to run qwen2-0.5b. It turns out:
Traceback (most recent call last): File "/workspace/vllm/vllm/worker/openvino_worker.py", line 302, in determine_num_available_blocks kv_cache_size = self.profile_run() File "/workspace/vllm/vllm/worker/openvino_worker.py", line 549, in profile_run model_profile_run() File "/workspace/vllm/vllm/worker/openvino_worker.py", line 538, in model_profile_run self.model_runner.execute_model(seqs, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/workspace/vllm/vllm/worker/openvino_model_runner.py", line 340, in execute_model hidden_states = model_executable(**execute_model_kwargs) File "/usr/local/lib/python3.10/dist-packages/nncf/torch/dynamic_graph/wrappers.py", line 146, in wrapped return module_call(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/workspace/vllm/vllm/model_executor/model_loader/openvino.py", line 164, in forward self.ov_request.wait() RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245: Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54: Caught exception: Check '!exceed_allocatable_mem_size' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:139: [GPU] Exceeded max size of memory object allocation: requested 19914555392 bytes, but max alloc size supported by device is 1073741824 bytes.Please try to reduce batch size or use lower precision.19914555392 bytes equals 18.5GB. That's strange. I tried some solutions, but they didn't solve my problem. Any solution or hint?
FYI: I use gpu-version vllm to run qwen2-1.5b, which uses ~8GB by showing nvidia-smi.
change memory here to 4gb solver my problem.
Hi @WoosukKwon, could you please take a look at these changes?
@mgoin, thank you for your comments! I applied them and rebased the branch on top of the recent main, please take a look
Thanks for the quick feedback. I am enabling the full CI to run - you may want to merge with latest main to get the CI green since there have been failures before today
@mgoin , @WoosukKwon could you activate automerge?