InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] vllm deploy InternVL3_5-241B-A28B error

Open liuxuexun opened this issue 3 months ago • 14 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654] Traceback (most recent call last):
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 649, in worker_busy_loop
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     output = func(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return func(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     self.model_runner.profile_run()
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3017, in profile_run
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     self.model.get_multimodal_embeddings(
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/internvl.py", line 1331, in get_multimodal_embeddings
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     video_embeddings = self._process_image_input(video_input)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/internvl.py", line 1264, in _process_image_input
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     image_embeds = self.extract_feature(image_input["pixel_values_flat"])
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/internvl.py", line 1154, in extract_feature
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     vit_embeds = self.vision_model(pixel_values=pixel_values)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/intern_vit.py", line 467, in forward
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     encoder_outputs = self.encoder(inputs_embeds=hidden_states)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/intern_vit.py", line 413, in forward
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     hidden_states = encoder_layer(hidden_states)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/intern_vit.py", line 372, in forward
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     hidden_states = hidden_states + self.attn(
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/intern_vit.py", line 277, in forward
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     x = self.attn(q, k, v)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return self._call_impl(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     return forward_call(*args, **kwargs)
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]   File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 383, in forward
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654]     bsz, q_len, _ = query.size()
(Worker_TP0 pid=173356) ERROR 09-15 15:24:40 [multiproc_executor.py:654] ValueError: too many values to unpack (expected 3)

I use vllm==0.10.2, think you !

Reproduction

vllm serve /root/models/InternVL3_5-241B-A28B/ --port 8084 --host 0.0.0.0 --dtype bfloat16 --max-model-len 16384 --tensor-parallel-size 8 --enforce-eager --allowed-local-media-path / --trust_remote_code

Environment

python=3.10 torch=2.8.0 vllm=0.10.2

Error traceback


liuxuexun avatar Sep 15 '25 07:09 liuxuexun

same err here

Journey7331 avatar Sep 16 '25 09:09 Journey7331

same err here

官方一直都没回,估计是还没支持这个模型,我试了下InternVL3_5-1B是可以部署的

liuxuexun avatar Sep 16 '25 09:09 liuxuexun

same err here

官方一直都没回,估计是还没支持这个模型,我试了下InternVL3_5-1B是可以部署的

似乎MoE都不太行,另外 5B ViT 的那个38B是不是也有点小问题?加载不了

Journey7331 avatar Sep 16 '25 09:09 Journey7331

same err here

官方一直都没回,估计是还没支持这个模型,我试了下InternVL3_5-1B是可以部署的

似乎MoE都不太行,另外 5B ViT 的那个38B是不是也有点小问题?加载不了

InternVL3_5-30B-A3B also MoE model. It works with vLLM 0.10.2

Pig255 avatar Sep 18 '25 08:09 Pig255

same err here

官方一直都没回,估计是还没支持这个模型,我试了下InternVL3_5-1B是可以部署的

似乎MoE都不太行,另外 5B ViT 的那个38B是不是也有点小问题?加载不了

我这边38B可以跑

liuxuexun avatar Sep 18 '25 09:09 liuxuexun

Sadly 😢 I got this error with InternVL3.5-30B-A3B, using latest code precompiled version vllm 0.1.dev2030+gf4cd80f94.d20250918.precompiled

...
Loading safetensors checkpoint shards:  77% Completed | 10/13 [03:30<00:42, 14.32s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [03:31<00:20, 10.25s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [03:32<00:07,  7.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [03:33<00:00,  5.52s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [03:33<00:00, 16.42s/it]
[default_loader.py:268] Loading weights took 213.61 seconds
[gpu_model_runner.py:2542] Model loading took 57.4721 GiB and 213.851256 seconds
[gpu_model_runner.py:3214] Encoder cache will be initialized with a budget of 16094 tokens, and profiled with 1 video items of the maximum feature size.
[core.py:712] EngineCore failed to start.
[core.py:712] Traceback (most recent call last):
[core.py:712]   File "/work/proj/vllm/vllm/v1/engine/core.py", line 703, in run_engine_core
[core.py:712]     engine_core = EngineCoreProc(*args, **kwargs)
[core.py:712]   File "/work/proj/vllm/vllm/v1/engine/core.py", line 502, in __init__
[core.py:712]     super().__init__(vllm_config, executor_class, log_stats,
[core.py:712]   File "/work/proj/vllm/vllm/v1/engine/core.py", line 90, in __init__
[core.py:712]     self._initialize_kv_caches(vllm_config)
[core.py:712]   File "/work/proj/vllm/vllm/v1/engine/core.py", line 188, in _initialize_kv_caches
[core.py:712]     self.model_executor.determine_available_memory())
[core.py:712]   File "/work/proj/vllm/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
[core.py:712]     return self.collective_rpc("determine_available_memory")
[core.py:712]   File "/work/proj/vllm/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
[core.py:712]     return [run_method(self.driver_worker, method, args, kwargs)]
[core.py:712]   File "/work/proj/vllm/vllm/utils/__init__.py", line 3067, in run_method
[core.py:712]     return func(*args, **kwargs)
[core.py:712]   File "/work/proj/vllm/.venv_local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
[core.py:712]     return func(*args, **kwargs)
[core.py:712]   File "/work/proj/vllm/vllm/v1/worker/gpu_worker.py", line 264, in determine_available_memory
[core.py:712]     self.model_runner.profile_run()
[core.py:712]   File "/work/proj/vllm/vllm/v1/worker/gpu_model_runner.py", line 3231, in profile_run
[core.py:712]     self.model.get_multimodal_embeddings(
[core.py:712]   File "/work/proj/vllm/vllm/model_executor/models/interns1.py", line 736, in get_multimodal_embeddings
[core.py:712]     modalities = self._parse_and_validate_multimodal_inputs(**kwargs)
[core.py:712]   File "/work/proj/vllm/vllm/model_executor/models/interns1.py", line 722, in _parse_and_validate_multimodal_inputs
[core.py:712]     modalities["videos"] = self._parse_and_validate_video_input(
[core.py:712]   File "/work/proj/vllm/vllm/model_executor/models/interns1.py", line 670, in _parse_and_validate_video_input
[core.py:712]     return InternS1VideoPixelInputs(
[core.py:712]   File "/work/proj/vllm/vllm/utils/tensor_schema.py", line 67, in __init__
[core.py:712]     self.validate()
[core.py:712]   File "/work/proj/vllm/vllm/utils/tensor_schema.py", line 222, in validate
[core.py:712]     self._validate_tensor_shape_expected(
[core.py:712]   File "/work/proj/vllm/vllm/utils/tensor_schema.py", line 146, in _validate_tensor_shape_expected
[core.py:712]     raise ValueError(f"{field_name} dim[{i}] expected "
[core.py:712] ValueError: pixel_values dim[2] expected 448, got 384

Journey7331 avatar Sep 18 '25 10:09 Journey7331

same err here

The error can be resolved by installing the specific version vllm==0.10.1.1.

WesKwong avatar Sep 19 '25 11:09 WesKwong

same err here

The error can be resolved by installing the specific version vllm==0.10.1.1.

Sorry for breaking this. https://github.com/vllm-project/vllm/pull/25146 should fix this.

Isotr0py avatar Sep 19 '25 12:09 Isotr0py

@Journey7331 I tried but unable to reproduce the tensor schema issue on vLLM's current main branch. The processed pixvel_values from OpenGVLab/InternVL3_5-30B-A3B-Instruct indeed has 448x448 size per patch on my side.

INFO 09-19 20:26:09 [__init__.py:216] Automatically detected platform cuda.
INFO 09-19 20:26:12 [utils.py:328] non-default args: {'trust_remote_code': True, 'load_format': 'dummy', 'max_model_len': 8192, 'hf_overrides': <function run_internvl.<locals>.dummy_hf_overrides at 0x784f38269b20>, 'limit_mm_per_prompt': {'image': 0, 'video': 1, 'audio': 0}, 'model': 'OpenGVLab/InternVL3_5-30B-A3B-Instruct'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
INFO 09-19 20:26:22 [__init__.py:712] Resolved architecture: InternVLChatModel
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 09-19 20:26:22 [__init__.py:1774] Using max model len 8192
INFO 09-19 20:26:23 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:27 [core.py:648] Waiting for init message from front-end.
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:27 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc3.dev202+gb33ce782c.d20250918) with config: model='OpenGVLab/InternVL3_5-30B-A3B-Instruct', speculative_config=None, tokenizer='OpenGVLab/InternVL3_5-30B-A3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=dummy, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=OpenGVLab/InternVL3_5-30B-A3B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=328923) W0919 20:26:27.596000 328923 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(EngineCore_DP0 pid=328923) W0919 20:26:27.596000 328923 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W919 20:26:28.559923202 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:28 [parallel_state.py:1206] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:28 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=328923) WARNING 09-19 20:26:28 [__init__.py:2179] The following intended overrides are not keyword args and will be dropped: {'truncation'}
(EngineCore_DP0 pid=328923) WARNING 09-19 20:26:28 [registry.py:183] InternVLProcessor did not return `BatchFeature`. Make sure to match the behaviour of `ProcessorMixin` when implementing custom processors.
(EngineCore_DP0 pid=328923) WARNING 09-19 20:26:28 [__init__.py:2179] The following intended overrides are not keyword args and will be dropped: {'truncation'}
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:28 [gpu_model_runner.py:2519] Starting to load model OpenGVLab/InternVL3_5-30B-A3B-Instruct...
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:28 [gpu_model_runner.py:2551] Loading model from scratch...
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:28 [layer.py:423] MultiHeadAttention attn_backend: _Backend.FLASH_ATTN, use_upstream_fa: False
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:29 [cuda.py:371] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:29 [gpu_model_runner.py:2573] Model loading took 1.2735 GiB and 0.117064 seconds
(EngineCore_DP0 pid=328923) INFO 09-19 20:26:29 [gpu_model_runner.py:3254] Encoder cache will be initialized with a budget of 8407 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=328923) WARNING 09-19 20:26:29 [__init__.py:2179] The following intended overrides are not keyword args and will be dropped: {'truncation'}
(EngineCore_DP0 pid=328923) WARNING 09-19 20:26:29 [__init__.py:2179] The following intended overrides are not keyword args and will be dropped: {'truncation'}
(EngineCore_DP0 pid=328923) pixel_values_flat_video: torch.Size([32, 3, 448, 448])

Isotr0py avatar Sep 19 '25 12:09 Isotr0py

@Isotr0py I hit a dependency resolution failure when installing the latest code in editable mode. Do you have any idea how to resolve?

...
Available versions: 0.10.2, 0.10.1.1, 0.10.1, 0.10.0, 0.9.2, 0.9.1, 0.9.0.1, 0.9.0, 0.8.5.post1, 0.8.5, 0.8.4, 0.8.3, 0.8.2, 0.8.1, 0.8.0, 0.7.3, 0.7.2, 0.7.1, 0.7.0, 0.6.6.post1, 0.6.6, 0.6.5, 0.6.4.post1, 0.6.4, 0.6.3.post1, 0.6.3, 0.6.2, 0.6.1.post2, 0.6.1.post1, 0.6.1, 0.6.0, 0.5.5, 0.5.4, 0.5.3.post1, 0.5.3, 0.5.2, 0.5.1, 0.5.0.post1, 0.5.0, 0.4.3, 0.4.2, 0.4.1, 0.4.0.post1, 0.4.0, 0.3.3, 0.3.2, 0.3.1, 0.3.0, 0.2.7, 0.2.6, 0.2.5, 0.2.4, 0.2.3, 0.2.2, 0.2.1.post1, 0.2.0, 0.1.7, 0.1.6, 0.1.5, 0.1.4, 0.1.3, 0.1.2, 0.1.1, 0.1.0, 0.0.1
  INSTALLED: 0.1.dev2070+g9d1c50a5a.d20250919.precompiled
  LATEST:    0.10.2

 * Upgrade vllm with precompiled wheels (editable mode)
 * VLLM_USE_PRECOMPILED=1 uv pip install -e . --torch-backend=cu118 -i https://pypi.tuna.tsinghua.edu.cn/simple
Proceed with upgrade? (y(↵)/n): 

Using Python 3.10.12 environment at: .venv_local
  × Failed to build `vllm @ file:///work/proj/vllm`
  ├─▶ Failed to resolve requirements from `build-system.requires`
  ├─▶ No solution found when resolving: `cmake>=3.26.1`, `ninja`, `packaging>=24.2`, `setuptools>=77.0.3, <80.0.0`, `setuptools-scm>=8.0`, `torch==2.8.0`, `wheel`, `jinja2`
  ╰─▶ Because there is no version of torch==2.8.0 and you require torch==2.8.0, we can conclude that your requirements are unsatisfiable.

# pip index  versions torch
Available versions: 2.8.0, 2.7.1, 2.7.0, 2.6.0, 2.5.1, 2.5.0, 2.4.1, 2.4.0, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0, 2.1.2, 2.1.1, 2.1.0, 2.0.1, 2.0.0, 1.13.1, 1.13.0, 1.12.1, 1.12.0, 1.11.0

#  nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Journey7331 avatar Sep 19 '25 13:09 Journey7331

Hmmm, I remember that pytorch 2.8 has deprecated cu118 support...

Isotr0py avatar Sep 19 '25 13:09 Isotr0py

Hmmm, I remember that pytorch 2.8 has deprecated cu118 support...

ohhh, thanks, works with cu126.


BTH, pixel_values error only exists with ckpt downloaded from OpenGVLab/InternVL3_5-30B-A3B-HF HF version

...
[core.py:712] ValueError: pixel_values dim[2] expected 448, got 384

Journey7331 avatar Sep 22 '25 06:09 Journey7331

BTH, pixel_values error only exists with ckpt downloaded from OpenGVLab/InternVL3_5-30B-A3B-HF HF version

Oh, I see. HF format models use a different model implementation compared to GitHub format, will take a look later today.

Isotr0py avatar Sep 22 '25 07:09 Isotr0py

Also seeing this error deploying on H100's using vllm v0.10.2.

--port 8002 --model /config/models/model --tensor-parallel-size 8 --disable-log-requests --enable-chunked-prefill --enable-prefix-caching --max-model-len 32768 --served-model-name intern-vl-241b-a28b --trust-remote-code --enable-expert-parallel

mphilippnv avatar Oct 14 '25 16:10 mphilippnv

same err here

官方一直都没回,估计是还没支持这个模型,我试了下InternVL3_5-1B是可以部署的

似乎MoE都不太行,另外 5B ViT 的那个38B是不是也有点小问题?加载不了

我这边38B可以跑

@liuxuexun 请问是vllm吗,我这边跑着加载模型参数有问题,vllm==0.11.2

ValueError: Following weights were not initialized from checkpoint: {'language_model.model.layers.2.mlp.down_proj.weight', 

hongjx175 avatar Nov 25 '25 10:11 hongjx175