nano-vllm
nano-vllm copied to clipboard

Published 20 hours ago •

Reame
Issues

Add Qwen3-VL multimodal support

Open 86MaxCao opened this issue 4 weeks ago • 1 comments

Summary

add the Qwen3-VL multimodal model and loader entry so nano-vllm can run vision-language workloads
extend engine components (placeholder expansion, vision-cache slicing, KV guard) to mirror vLLM’s multimodal behavior
provide bench_multimodal.py and example_multimodal.py for benchmarking and quick testing
document how to download Qwen3-VL-2B-Instruct and where to find the multimodal example

Benchmark

GPU: NVIDIA H20 (96GB)
Command: CUDA_VISIBLE_DEVICES=0 python3 bench_multimodal.py --model ~/huggingface/Qwen3-VL-2B-Instruct
Result: 10 requests · 2958 prompt tokens · 2629 generated tokens · 12.49 s latency · 210.55 tok/s throughput

Testing

python3 example_multimodal.py

Notes

large diff because the feature touches model loading, scheduler, and caching; happy to walk through the details if needed
if maintainers feel multimodal support shouldn’t land in core yet, I’m open to discussing an extension repo instead

Nov 11 '25 21:11 86MaxCao