[vLLM backend] Multimodal support for OpenAI-Compatible frontend

Open KarlDe1 opened this issue 7 months ago • 1 comments

I'm currently using the Qwen2.5-VL model in a single-node, single-process environment, with 8 H20 GPUs on one machine. I want to deploy the model on Triton, with each GPU loading one instance of the model using vLLM as the backend. The client sends HTTP requests using the OpenAI API format, so I also hope that Triton's frontend can support multimodal input (since I'm working on autonomous driving inference, which involves image data). However, at the moment, I've only managed to get single-text inference working. It seems that multimodal support is not currently available.

May 21 '25 09:05 KarlDe1

Hope the developers can respond actively.

May 21 '25 09:05 KarlDe1