nano-vllm icon indicating copy to clipboard operation
nano-vllm copied to clipboard

Add Qwen3-VL multimodal support

Open 86MaxCao opened this issue 4 weeks ago • 1 comments

Summary

  • add the Qwen3-VL multimodal model and loader entry so nano-vllm can run vision-language workloads
  • extend engine components (placeholder expansion, vision-cache slicing, KV guard) to mirror vLLM’s multimodal behavior
  • provide bench_multimodal.py and example_multimodal.py for benchmarking and quick testing
  • document how to download Qwen3-VL-2B-Instruct and where to find the multimodal example

Benchmark

  • GPU: NVIDIA H20 (96GB)
  • Command: CUDA_VISIBLE_DEVICES=0 python3 bench_multimodal.py --model ~/huggingface/Qwen3-VL-2B-Instruct
  • Result: 10 requests · 2958 prompt tokens · 2629 generated tokens · 12.49 s latency · 210.55 tok/s throughput

Testing

  • python3 example_multimodal.py

Notes

  • large diff because the feature touches model loading, scheduler, and caching; happy to walk through the details if needed
  • if maintainers feel multimodal support shouldn’t land in core yet, I’m open to discussing an extension repo instead

86MaxCao avatar Nov 11 '25 21:11 86MaxCao