[Bug] InternVL 3.5 38B OOM
Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
我使用以下命令启动InternVL3.5-38B:
#!/bin/bash
export OMP_NUM_THREADS=8
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
vllm serve /mnt/tenant-home_speed/VLM/InternVL3_5-38B
--tensor-parallel-size 2
--quantization fp8
--gpu-memory-utilization 0.95
--served-model-name Qwen25VL32B
--max-model-len 40960
--trust-remote-code
--host 0.0.0.0
--port 8005
--enforce-eager
--disable-log-requests
使用2张L40s(48G),发送视频,基本上到total_token=6000就上限了。增加token后直接oom
使用QwenVL等模型,可以正常发送40000左右的token对话不报错。
Reproduction
#!/bin/bash
export OMP_NUM_THREADS=8
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
vllm serve /mnt/tenant-home_speed/VLM/InternVL3_5-38B
--tensor-parallel-size 2
--quantization fp8
--gpu-memory-utilization 0.95
--served-model-name Qwen25VL32B
--max-model-len 40960
--trust-remote-code
--host 0.0.0.0
--port 8005
--enforce-eager
--disable-log-requests
Environment
vllm 0.10.2
torch 2.8.0
torchaudio 2.8.0
torchvision 0.23.0
flash_attn 2.8.3
flashinfer-python 0.3.1
Error traceback
[ERROR] curr_group_outputs = self.model.get_multimodal_embeddings(
[ERROR] File "/root/envs/vllm/model_executor/models/internvl.py", line 1331, in get_multimodal_embeddings
[ERROR] video_embeddings = self._process_image_input(video_input)
[ERROR] File "/root/envs/vllm/model_executor/models/internvl.py", line 1264, in _process_image_input
[ERROR] image_embeds = self.extract_feature(image_input["pixel_values_flat"])
[ERROR] File "/root/envs/vllm/model_executor/models/internvl.py", line 1154, in extract_feature
[ERROR] vit_embeds = self.vision_model(pixel_values=pixel_values)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[ERROR] return self._call_impl(*args, **kwargs)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1784, in _call_impl
[ERROR] return forward_call(*args, **kwargs)
[ERROR] File "/root/envs/vllm/model_executor/models/intern_vit.py", line 470, in forward
[ERROR] encoder_outputs = self.encoder(inputs_embeds=hidden_states)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[ERROR] return self._call_impl(*args, **kwargs)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1784, in _call_impl
[ERROR] return forward_call(*args, **kwargs)
[ERROR] File "/root/envs/vllm/model_executor/models/intern_vit.py", line 416, in forward
[ERROR] hidden_states = encoder_layer(hidden_states)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[ERROR] return self._call_impl(*args, **kwargs)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1784, in _call_impl
[ERROR] return forward_call(*args, **kwargs)
[ERROR] File "/root/envs/vllm/model_executor/models/intern_vit.py", line 375, in forward
[ERROR] hidden_states = hidden_states + self.attn(
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[ERROR] return self._call_impl(*args, **kwargs) [235/1916]
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1784, in _call_impl
[ERROR] return forward_call(*args, **kwargs)
[ERROR] File "/root/envs/vllm/model_executor/models/intern_vit.py", line 273, in forward
[ERROR] q = self.q_norm(q.flatten(-2, -1)).view(B_, N_, H_, D_)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[ERROR] return self._call_impl(*args, **kwargs)
[ERROR] File "/root/envs/torch/nn/modules/module.py", line 1784, in _call_impl
[ERROR] return forward_call(*args, **kwargs)
[ERROR] File "/root/envs/vllm/model_executor/custom_op.py", line 44, in forward
[ERROR] return self._forward_method(*args, **kwargs)
[ERROR] File "/root/envs/vllm/model_executor/layers/layernorm.py", line 224, in forward_cuda
[ERROR] return rms_norm(x, self.weight.data, self.variance_epsilon)
[ERROR] File "/root/envs/vllm/model_executor/layers/layernorm.py", line 24, in rms_norm
[ERROR] ops.rms_norm(
[ERROR] File "/root/envs/vllm/_custom_ops.py", line 274, in rms_norm
[ERROR] input_contiguous = input.contiguous()
[ERROR] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 264.00 MiB. GPU 1 has a total capacity of 44
.52 GiB of which 260.25 MiB is free. Process 593161 has 44.26 GiB memory in use. Of the allocated memory 42.89 GiB is allocated by PyTorch, and 386.61 MiB is reserved by PyTorch bu
t unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Manage
ment (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(EngineCore_DP0 pid=1667686) ERROR 09-15 04:00:40 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmp
l-81f63c5d7d0c43838d6122556fc01fe7,prompt_token_ids_len=11122,mm_kwargs=[{'video_num_patches': MultiModalFieldElem(modality='video', key='video_num_patches', data=tensor(42), field
=MultiModalBatchedField()), 'video_token_id': MultiModalFieldElem(modality='video', key='video_token_id', data=tensor(151656), field=MultiModalSharedField(batch_size=1)), 'pixel_va
lues_flat_video': MultiModalFieldElem(modality='video', key='pixel_values_flat_video', data=tensor([[[[-2.1250, -2.1250, -2.1250, ..., -0.6641, -0.6641, -0.6445],
这个看起来是因为ViT-6B太大,导致batch推理的时候爆显存了,可以把这部分代码改成下面的代码这个写法:
bsz = 4
last_hidden_state = []
for splitted_hidden_states in hidden_states.split(bsz, dim=0):
encoder_outputs = self.encoder(
inputs_embeds=splitted_hidden_states,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
last_hidden_state.append(encoder_outputs.last_hidden_state)
last_hidden_state = torch.cat(last_hidden_state)
pooled_output = last_hidden_state[:, 0, :]
bsz = hidden_states.shape[0]
encoder_outputs = []
for splitted_hidden_states in hidden_states.split(bsz, dim=0):
encoder_outputs_last = self.encoder(inputs_embeds=splitted_hidden_states)
encoder_outputs.append(encoder_outputs_last)
encoder_outputs = torch.cat(encoder_outputs)
return encoder_outputs
我改成了这样,但还是会爆显存。。似乎并没有多少改善