server
server copied to clipboard
[vLLM backend] Multimodal support for OpenAI-Compatible frontend
I'm currently using the Qwen2.5-VL model in a single-node, single-process environment, with 8 H20 GPUs on one machine. I want to deploy the model on Triton, with each GPU loading one instance of the model using vLLM as the backend. The client sends HTTP requests using the OpenAI API format, so I also hope that Triton's frontend can support multimodal input (since I'm working on autonomous driving inference, which involves image data). However, at the moment, I've only managed to get single-text inference working. It seems that multimodal support is not currently available.
Hope the developers can respond actively.