Sergey Shlyapnikov
Sergey Shlyapnikov
Hi @liuxingbin Can you share how you are running VLLM? Did you try setting a lower max_model_length value? We assume there should be enough GPU memory to run max_model_length tokens...
Hi @awayzjj , thank you for checking the issue! Let me add more details. The issue is connected with an incorrect performance profiling report for the IF operation. It is...
Hi @JulienMaille, Could you please share the installed GPU driver version? Also, could you please check if the issue can be reproduced using [benchmark_app](https://docs.openvino.ai/nightly/get-started/learn-openvino/openvino-samples/benchmark-tool.html#examples-of-running-the-tool) tool?
By the way, the current version implements dynamism through kernel recompilation for each new dynamic shape configuration. However, we could support a shape_agnostic kernel version that can be compiled once...
@xipingyan , can you please check CI test failures? ``` ov_gpu_func_tests-0 INFO: FAILED TESTS (1/39269): ov_gpu_func_tests-0 INFO: 2909 ms: ov_gpu_func_tests smoke_CustomOpDynamic.Accuracy ```
@AKochin , @dmitry-gorokhov, could you please review the changes from Transformations and CPU sides?
Hi @WoosukKwon, could you please take a look at these changes?
@mgoin, thank you for your comments! I [applied them](https://github.com/vllm-project/vllm/pull/8192/commits/1723d77e7352d7138b14d1427cc16f1987ef5761) and rebased the branch on top of the recent main, please take a look
@Kotomi-Du, how about the following implementation? 1) Keep the existing order of allocations and memory reuse for the sum post-op 2) Move the logic related to onednn impls node memory...