lightllm Add qwen3 vl

Nov 04 '25 11:11 SangChengC

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model capabilities by adding full support for the Qwen3-VL and Qwen3-VL MoE models. It involves a deep integration of multimodal features, including a new visual encoder, specialized tokenization, and the innovative 'deepstack' mechanism to pass rich visual context through multiple layers. Alongside these major additions, the PR refines existing image processing, enhances Rotary Positional Embeddings for complex scenarios, and optimizes MoE weight handling and Whisper's attention for better performance.

Highlights

Qwen3-VL Model Integration: Introduced comprehensive support for the Qwen3-VL (Vision-Language) model, including its Mixture-of-Experts (MoE) variant, by adding new model definitions, inference logic, and weight handling.
Deepstack Feature Support: Integrated 'deepstack' features for multimodal models, allowing intermediate visual features from different layers of the vision encoder to be passed to the language model, enhancing multimodal understanding.
Advanced Image Processing: Enhanced image processing utilities with more flexible smart_resize and resize_image functions, explicitly handling factor, min_pixels, and max_pixels for better control over image resizing and token length calculation.
Rotary Positional Embedding (RoPE) Enhancements: Updated RoPE handling in the Llama model to support mrope_section and a new 'default' scaling type, along with an apply_interleaved_mrope function for 3D rotary embeddings in Qwen3-VL.
MoE Weight Fusion Improvements: Modified the FusedMoeWeightTP class to correctly handle fused gate-up weights (3D tensors) for MoE experts, including tensor parallelism splitting and transposing, while also adding a check for quantization compatibility.
Whisper Model Optimization: Optimized the Whisper model's attention mechanism by introducing WhisperSdpaAttention, which leverages torch.nn.functional.scaled_dot_product_attention for improved efficiency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Nov 04 '25 11:11 gemini-code-assist[bot]

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Qwen3 Vision-Language (VL) model into the LightLLM framework, enabling efficient processing of multimodal inputs. It involves adapting the model's architecture, including its Mixture-of-Experts (MoE) components and specialized image processing pipeline, to work seamlessly within the existing inference system. The changes enhance the framework's capability to handle complex visual and textual data, ensuring accurate and performant multimodal inference for Qwen3 VL.

Highlights

Qwen3 VL Model Integration: Introduced comprehensive support for the Qwen3 Vision-Language (VL) model, including both standard and Mixture-of-Experts (MoE) variants, within the LightLLM framework.
Multimodal Image Processing Enhancements: Updated image resizing logic and introduced new vision transformer components (e.g., Qwen3VLPatchEmbed, Qwen3VLVisionBlock) to handle Qwen3 VL's specific image processing requirements, including deepstack features.
Advanced Rotary Embeddings (MRoPE): Implemented interleaved MRoPE (Multi-dimensional Rotary Position Embeddings) for Qwen3 VL, allowing for more complex positional encoding in multimodal contexts.
Inference State and Weight Management: Added dedicated inference state (Qwen3VLMOEInferStateInfo) and weight loading classes for Qwen3 VL, optimizing weight fusion for MoE layers and ensuring proper tensor parallelism handling.
Shared Memory for Deepstack Features: Extended shared memory utilities to support the efficient transfer and storage of 'deepstack features' generated by the vision model, which are then integrated into the language model's embeddings.
Whisper Attention Optimization: Migrated the Whisper encoder layer to use WhisperSdpaAttention, leveraging torch.nn.functional.scaled_dot_product_attention for potential performance improvements.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Nov 04 '25 11:11 gemini-code-assist[bot]