vllm [OpenVINO] Enable GPU support for OpenVINO vLLM backend

These changes add GPU device support for OpenVINO vLLM backend

Added VLLM_OPENVINO_DEVICE environment variable for OpenVINO device selection
Updated GPU-related components in OpenVINO backend (KV cache shapes, swap capability, model profiling run etc)
Updated OpenVINO version to 2024.4 RC1 in dependencies
Updated installation instructions

Some performance measurements obtained for Intel ARC A770 (16GB) for 1000 prompts from ShareGPT_V3_unfiltered_cleaned_split dataset:

Sep 05 '24 13:09 sshlyapn

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Sep 05 '24 13:09 github-actions[bot]

Hi I tried the PR, but new error occurred. I use openvino-gpu to run qwen2-0.5b. It turns out:

Traceback (most recent call last):
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 302, in determine_num_available_blocks
    kv_cache_size = self.profile_run()
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 549, in profile_run
    model_profile_run()
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 538, in model_profile_run
    self.model_runner.execute_model(seqs,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/openvino_model_runner.py", line 340, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nncf/torch/dynamic_graph/wrappers.py", line 146, in wrapped
    return module_call(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/model_loader/openvino.py", line 164, in forward
    self.ov_request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Check '!exceed_allocatable_mem_size' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:139:
[GPU] Exceeded max size of memory object allocation: requested 19914555392 bytes, but max alloc size supported by device is 1073741824 bytes.Please try to reduce batch size or use lower precision.

19914555392 bytes equals 18.5GB. That's strange. I tried some solutions, but they didn't solve my problem. Any solution or hint?

FYI: I use gpu-version vllm to run qwen2-1.5b, which uses ~8GB by showing nvidia-smi.

Sep 10 '24 08:09 liuxingbin

Hi I tried the PR, but new error occurred. I use openvino-gpu to run qwen2-0.5b. It turns out:

Traceback (most recent call last):
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 302, in determine_num_available_blocks
    kv_cache_size = self.profile_run()
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 549, in profile_run
    model_profile_run()
  File "/workspace/vllm/vllm/worker/openvino_worker.py", line 538, in model_profile_run
    self.model_runner.execute_model(seqs,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/vllm/worker/openvino_model_runner.py", line 340, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nncf/torch/dynamic_graph/wrappers.py", line 146, in wrapped
    return module_call(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/vllm/model_executor/model_loader/openvino.py", line 164, in forward
    self.ov_request.wait()
RuntimeError: Exception from src/inference/src/cpp/infer_request.cpp:245:
Exception from src/bindings/python/src/pyopenvino/core/infer_request.hpp:54:
Caught exception: Check '!exceed_allocatable_mem_size' failed at src/plugins/intel_gpu/src/runtime/ocl/ocl_engine.cpp:139:
[GPU] Exceeded max size of memory object allocation: requested 19914555392 bytes, but max alloc size supported by device is 1073741824 bytes.Please try to reduce batch size or use lower precision.

19914555392 bytes equals 18.5GB. That's strange. I tried some solutions, but they didn't solve my problem. Any solution or hint?

FYI: I use gpu-version vllm to run qwen2-1.5b, which uses ~8GB by showing nvidia-smi.

change memory here to 4gb solver my problem.

Sep 10 '24 12:09 liuxingbin

Hi @WoosukKwon, could you please take a look at these changes?

Sep 12 '24 14:09 sshlyapn

@mgoin, thank you for your comments! I applied them and rebased the branch on top of the recent main, please take a look

Sep 27 '24 14:09 sshlyapn

Thanks for the quick feedback. I am enabling the full CI to run - you may want to merge with latest main to get the CI green since there have been failures before today

Sep 27 '24 15:09 mgoin

@mgoin , @WoosukKwon could you activate automerge?

Oct 01 '24 06:10 p-durandin