Add Qwen3-VL multimodal support
Open
86MaxCao
opened this issue 4 weeks ago
•
1 comments
Summary
- add the Qwen3-VL multimodal model and loader entry so nano-vllm can run vision-language workloads
- extend engine components (placeholder expansion, vision-cache slicing, KV guard) to mirror vLLM’s multimodal behavior
- provide
bench_multimodal.py and example_multimodal.py for benchmarking and quick testing
- document how to download Qwen3-VL-2B-Instruct and where to find the multimodal example
Benchmark
- GPU: NVIDIA H20 (96GB)
- Command:
CUDA_VISIBLE_DEVICES=0 python3 bench_multimodal.py --model ~/huggingface/Qwen3-VL-2B-Instruct
- Result: 10 requests · 2958 prompt tokens · 2629 generated tokens · 12.49 s latency · 210.55 tok/s throughput
Testing
python3 example_multimodal.py
Notes
- large diff because the feature touches model loading, scheduler, and caching; happy to walk through the details if needed
- if maintainers feel multimodal support shouldn’t land in core yet, I’m open to discussing an extension repo instead