sglang
sglang copied to clipboard
model(vlm): mistral 3.1
Motivation
Support Mistral Small 3.1 VLM (#4518).
Modifications
This is an extension to #5084 (#2351) by reusing the same LlavaForConditionalGeneration backbone and Pixtral vision encoder.
- [x] text generation
- [x]
imagemodalities - [x]
multi-imagesmodalities - [x] tool calling
- [x] structured output
- [x] update mistral chat template
Checklist
- [x] Format your code according to the Code Formatting with Pre-Commit.
- [x] Add unit tests as outlined in the Running Unit Tests.
- [x] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
- [x] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
- [x] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
- [x] Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.
mmmu val: 0.522 cuda device: 2xA100 max throughput: about 60 tokens/s (heavy image prefill)
Benchmark time: 182.7642366886139
answers saved to: ./val_sglang.json
Evaluating...
{'Accounting': {'acc': 0.4, 'num': 30},
'Agriculture': {'acc': 0.562, 'num': 16},
'Architecture_and_Engineering': {'acc': 0.333, 'num': 30},
'Art': {'acc': 0.667, 'num': 30},
'Art_Theory': {'acc': 0.667, 'num': 30},
'Basic_Medical_Science': {'acc': 0.7, 'num': 30},
'Biology': {'acc': 0.533, 'num': 30},
'Chemistry': {'acc': 0.367, 'num': 30},
'Clinical_Medicine': {'acc': 0.767, 'num': 30},
'Computer_Science': {'acc': 0.5, 'num': 30},
'Design': {'acc': 0.8, 'num': 30},
'Diagnostics_and_Laboratory_Medicine': {'acc': 0.467, 'num': 30},
'Economics': {'acc': 0.5, 'num': 30},
'Electronics': {'acc': 0.4, 'num': 30},
'Energy_and_Power': {'acc': 0.333, 'num': 30},
'Finance': {'acc': 0.3, 'num': 30},
'Geography': {'acc': 0.6, 'num': 30},
'History': {'acc': 0.7, 'num': 30},
'Literature': {'acc': 0.862, 'num': 29},
'Manage': {'acc': 0.467, 'num': 30},
'Marketing': {'acc': 0.567, 'num': 30},
'Materials': {'acc': 0.4, 'num': 30},
'Math': {'acc': 0.233, 'num': 30},
'Mechanical_Engineering': {'acc': 0.3, 'num': 30},
'Music': {'acc': 0.2, 'num': 30},
'Overall': {'acc': 0.522, 'num': 885},
'Overall-Art and Design': {'acc': 0.583, 'num': 120},
'Overall-Business': {'acc': 0.447, 'num': 150},
'Overall-Health and Medicine': {'acc': 0.653, 'num': 150},
'Overall-Humanities and Social Science': {'acc': 0.723, 'num': 119},
'Overall-Science': {'acc': 0.427, 'num': 150},
'Overall-Tech and Engineering': {'acc': 0.393, 'num': 196},
'Pharmacy': {'acc': 0.567, 'num': 30},
'Physics': {'acc': 0.4, 'num': 30},
'Psychology': {'acc': 0.6, 'num': 30},
'Public_Health': {'acc': 0.767, 'num': 30},
'Sociology': {'acc': 0.733, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.522
@KivenChen rebase with main?
@GeLee-Q and @minleminzui is on this PR.
@yhyang201 yuhao is on this! thanks!
@kevin85421 hey kevin, thanks so much for your help. Do you use wechat? You can add me through my wechat ID learnAIcantSaveChina
@zhaochenyang20 maybe you want to ping @KivenChen? 😆
Ran into an error when running it like so:
python3 -m sglang.launch_server --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic --chat-template=mistral --speculative-algorithm EAGLE --speculative-draft-model-path kavin1337/Mistral-Small-3.1-DRAFT-0.5B-FP8-Dynamic --speculative-num-steps 3 --speculative-eagle-topk 4 --speculative-num-draft-tokens 16 --host 0.0.0.0 --port 30000
Error log:
2025-04-25T17:50:36.228441883Z ==================================
2025-04-25T17:50:36.228448813Z == Triton Inference Server Base ==
2025-04-25T17:50:36.228453153Z ==================================
2025-04-25T17:50:36.231612353Z NVIDIA Release 24.04 (build 90085237)
2025-04-25T17:50:36.232640859Z Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2025-04-25T17:50:36.233754407Z Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2025-04-25T17:50:36.233763677Z This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2025-04-25T17:50:36.233768477Z By pulling and using the container, you accept the terms and conditions of this license:
2025-04-25T17:50:36.233772797Z https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2025-04-25T17:50:40.447537084Z [2025-04-25 17:50:40] server_args=ServerArgs(model_path='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic', tokenizer_path='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic', tokenizer_mode='auto', skip_tokenizer_init=False, enable_tokenizer_batch_encode=False, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic', chat_template='llama-2', completion_template=None, is_embedding=False, revision=None, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=48, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=1, stream_interval=1, stream_output=False, random_seed=598240062, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', speculative_algorithm='EAGLE', speculative_draft_model_path='kavin1337/Mistral-Small-3.1-DRAFT-0.5B-FP8-Dynamic', speculative_num_steps=3, speculative_eagle_topk=4, speculative_num_draft_tokens=16, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_nccl_nvls=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_multimodal=None, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=None, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, warmups=None, moe_dense_tp_size=None, n_share_experts_fusion=0, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_bootstrap_port=8998, disaggregation_transfer_backend='mooncake', disaggregation_ib_device=None)
2025-04-25T17:50:40.655004754Z [2025-04-25 17:50:40] Downcasting torch.float32 to torch.float16.
2025-04-25T17:50:40.966376352Z [2025-04-25 17:50:40] Ignore import error when loading sglang.srt.managers.multimodal_processors.pixtral: cannot import name 'MultiModalityDataPaddingPatternImageTokens' from 'sglang.srt.managers.mm_utils' (/sgl-workspace/python/sglang/srt/managers/mm_utils.py)
2025-04-25T17:50:44.249508190Z [2025-04-25 17:50:44 TP0] Downcasting torch.float32 to torch.float16.
2025-04-25T17:50:45.321056933Z Traceback (most recent call last):
2025-04-25T17:50:45.321085584Z File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2025-04-25T17:50:45.321088504Z return _run_code(code, main_globals, None,
2025-04-25T17:50:45.321090594Z File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2025-04-25T17:50:45.321093174Z exec(code, run_globals)
2025-04-25T17:50:45.321095504Z File "/sgl-workspace/python/sglang/launch_server.py", line 14, in <module>
2025-04-25T17:50:45.321101354Z launch_server(server_args)
2025-04-25T17:50:45.321103284Z File "/sgl-workspace/python/sglang/srt/entrypoints/http_server.py", line 700, in launch_server
2025-04-25T17:50:45.321106224Z tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
2025-04-25T17:50:45.321108144Z File "/sgl-workspace/python/sglang/srt/entrypoints/engine.py", line 573, in _launch_subprocesses
2025-04-25T17:50:45.321110624Z tokenizer_manager = TokenizerManager(server_args, port_args)
2025-04-25T17:50:45.321112594Z File "/sgl-workspace/python/sglang/srt/managers/tokenizer_manager.py", line 194, in __init__
2025-04-25T17:50:45.321114414Z self.mm_processor = get_mm_processor(
2025-04-25T17:50:45.321116274Z File "/sgl-workspace/python/sglang/srt/managers/multimodal_processor.py", line 60, in get_mm_processor
2025-04-25T17:50:45.321118165Z return processor_cls(hf_config, server_args, processor)
2025-04-25T17:50:45.321120114Z File "/sgl-workspace/python/sglang/srt/managers/multimodal_processors/llava.py", line 208, in __init__
2025-04-25T17:50:45.321121934Z self.inner = self._get_sgl_processor_cls(vision_type)(
2025-04-25T17:50:45.321123914Z File "/sgl-workspace/python/sglang/srt/managers/multimodal_processors/llava.py", line 196, in _get_sgl_processor_cls
2025-04-25T17:50:45.321126465Z raise ValueError(
2025-04-25T17:50:45.321129075Z ValueError: Cannot find corresponding multimodal processor registered in sglang for model type `pixtral`
2025-04-25T17:50:45.326266286Z /usr/lib/python3.10/multiprocessing/resource_tracker.py:104: UserWarning: resource_tracker: process died unexpectedly, relaunching. Some resources might leak.
2025-04-25T17:50:45.326282346Z warnings.warn('resource_tracker: process died unexpectedly, '
2025-04-25T17:50:45.358969405Z Traceback (most recent call last):
2025-04-25T17:50:45.358992965Z File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
2025-04-25T17:50:45.358996305Z cache[rtype].remove(name)
2025-04-25T17:50:45.358998815Z KeyError: '/mp-5vpcz0od'
Is there something else I should be doing instead?
Edit: this is caused by c998d04b46920f06d945fbef9023884a768723fc, as it the class was renamed and modified for all modalities?
Hi @FireMasterK, you are correct about the cause. It is now up to date.
Priority for this PR please! @ch-wan @minleminzui @zhaochenyang20
@KivenChen fix lint?
readily fixed @zhaochenyang20
We should later delete docs/supported_models/vision_language_models.md and move mistral part to docs/supported_models/multimodal_language_models.md, but it's OK to merge now.