[BUG] <title> vllm 离线推理报错
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [x] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
(EngineCore_0 pid=1205373) INFO 08-25 20:15:23 [gpu_worker.py:276] Available KV cache memory: 7.39 GiB
(EngineCore_0 pid=1205373) INFO 08-25 20:15:23 [kv_cache_utils.py:849] GPU KV cache size: 242,272 tokens
(EngineCore_0 pid=1205373) INFO 08-25 20:15:23 [kv_cache_utils.py:853] Maximum concurrency for 10,000 tokens per request: 24.23x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 35.63it/s]
(EngineCore_0 pid=1205373) INFO 08-25 20:15:25 [gpu_model_runner.py:2708] Graph capturing finished in 2 secs, took 1.91 GiB
(EngineCore_0 pid=1205373) INFO 08-25 20:15:25 [core.py:214] init engine (profile, create kv cache, warmup model) took 10.71 seconds
INFO 08-25 20:15:25 [llm.py:298] Supported_tasks: ['generate']
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]/home/zhangwenkang/anaconda3/envs/minicpm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:640: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.62it/s]
Processed prompts: 100%|████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.35it/s, est. speed input: 2074.13 toks/s, output: 63.76 toks/s]
546 670 595 719
ERROR 08-25 20:15:26 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
vllm 0.10.1.1 驱动版本 12.2 torch 2.7.1 torchaudio 2.7.1 torchvision 0.22.1
是我的驱动版本太低了吗
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
备注 | Anything else?
No response
Could you please provide a bit more information so we can better pinpoint the issue? For example, the vLLM startup command, the full logs, and your GPU model?
您能否提供更多信息,以便我们更好地查明问题所在?例如,vLLM 启动命令、完整日志以及您的 GPU 型号?
GPU 4090 cuda version12.2
infer code
from transformers import AutoTokenizer from PIL import Image from vllm import LLM, SamplingParams
MODEL_NAME = "/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4"
image = Image.open("/home/zhangwenkang/code/mm/LLaMA-Factory/dataset/aqy/aqy-3-test-aqy-tencent/电视剧/黑屏/0825112024.png").convert("RGB") tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM( model=MODEL_NAME, max_model_len=10000, trust_remote_code=True, gpu_memory_utilization=0.8, disable_mm_preprocessor_cache=True, dtype="bfloat16", limit_mm_per_prompt={"image":1 } )
messages = [{
"role": "user",
"content": "(
prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True )
inputs = { "prompt": prompt, "multi_modal_data": { "image": image # For multi-image inference, use list format: # "image": [image1, image2] }, }
stop_tokens = ['<|im_end|>', '<|endoftext|>'] stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
sampling_params = SamplingParams( stop_token_ids=stop_token_ids, temperature=0.7, top_p=0.8, max_tokens=4096 )
outputs = llm.generate(inputs, sampling_params=sampling_params) print(outputs[0].outputs[0].text)
start commond python xxx.py
log
INFO 08-26 11:50:39 [init.py:241] Automatically detected platform cuda.
INFO 08-26 11:50:40 [utils.py:326] non-default args: {'model': '/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 10000, 'gpu_memory_utilization': 0.8, 'disable_log_stats': True, 'limit_mm_per_prompt': {'image': 1}, 'disable_mm_preprocessor_cache': True}
WARNING 08-26 11:50:40 [arg_utils.py:888] --disable-mm-preprocessor-cache is deprecated and will be removed in v0.13. Please use --mm-processor-cache-gb 0 instead.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
INFO 08-26 11:50:43 [init.py:711] Resolved architecture: MiniCPMV
INFO 08-26 11:50:43 [init.py:1750] Using max model len 10000
INFO 08-26 11:50:43 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=1389956) INFO 08-26 11:50:44 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=1389956) INFO 08-26 11:50:44 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config: model='/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4', speculative_config=None, tokenizer='/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=1389956) INFO 08-26 11:50:44 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1389956) WARNING 08-26 11:50:44 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=1389956) /home/zhangwenkang/anaconda3/envs/minicpm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:640: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
(EngineCore_0 pid=1389956) warnings.warn(
(EngineCore_0 pid=1389956) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(EngineCore_0 pid=1389956) INFO 08-26 11:50:45 [gpu_model_runner.py:1953] Starting to load model /home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4...
(EngineCore_0 pid=1389956) INFO 08-26 11:50:45 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1389956) INFO 08-26 11:50:45 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1389956) INFO 08-26 11:50:45 [cuda.py:345] Using FlexAttention backend for head_size=72 on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.75it/s]
(EngineCore_0 pid=1389956)
(EngineCore_0 pid=1389956) INFO 08-26 11:50:47 [default_loader.py:262] Loading weights took 1.23 seconds
(EngineCore_0 pid=1389956) INFO 08-26 11:50:47 [gpu_model_runner.py:2007] Model loading took 7.6119 GiB and 1.382586 seconds
(EngineCore_0 pid=1389956) INFO 08-26 11:50:47 [gpu_model_runner.py:2591] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 7 video items of the maximum feature size.
(EngineCore_0 pid=1389956) INFO 08-26 11:50:52 [backends.py:548] Using cache directory: /home/zhangwenkang/.cache/vllm/torch_compile_cache/d6da71c45d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1389956) INFO 08-26 11:50:52 [backends.py:559] Dynamo bytecode transform time: 3.10 s
(EngineCore_0 pid=1389956) INFO 08-26 11:50:54 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.271 s
(EngineCore_0 pid=1389956) INFO 08-26 11:50:55 [monitor.py:34] torch.compile takes 3.10 s in total
(EngineCore_0 pid=1389956) INFO 08-26 11:50:55 [gpu_worker.py:276] Available KV cache memory: 7.39 GiB
(EngineCore_0 pid=1389956) INFO 08-26 11:50:56 [kv_cache_utils.py:849] GPU KV cache size: 242,272 tokens
(EngineCore_0 pid=1389956) INFO 08-26 11:50:56 [kv_cache_utils.py:853] Maximum concurrency for 10,000 tokens per request: 24.23x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████| 67/67 [00:01<00:00, 36.11it/s]
(EngineCore_0 pid=1389956) INFO 08-26 11:50:58 [gpu_model_runner.py:2708] Graph capturing finished in 2 secs, took 1.91 GiB
(EngineCore_0 pid=1389956) INFO 08-26 11:50:58 [core.py:214] init engine (profile, create kv cache, warmup model) took 10.73 seconds
INFO 08-26 11:50:58 [llm.py:298] Supported_tasks: ['generate']
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]/home/zhangwenkang/anaconda3/envs/minicpm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:640: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.60it/s]
Processed prompts: 100%|████████████████████████████████████| 1/1 [00:00<00:00, 3.36it/s, est. speed input: 2078.60 toks/s, output: 63.90 toks/s]
546 670 595 719
ERROR 08-26 11:50:59 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
accelerate 1.10.0 aiohappyeyeballs 2.6.1 aiohttp 3.12.15 aiosignal 1.4.0 annotated-types 0.7.0 anyio 4.10.0 astor 0.8.1 async-timeout 5.0.1 attrs 25.3.0 blake3 1.0.5 cachetools 6.1.0 cbor2 5.7.0 certifi 2025.8.3 cffi 1.17.1 charset-normalizer 3.4.3 click 8.2.1 cloudpickle 3.1.1 compressed-tensors 0.10.2 cupy-cuda12x 13.6.0 depyf 0.19.0 dill 0.4.0 diskcache 5.6.3 distro 1.9.0 dnspython 2.7.0 einops 0.8.1 email_validator 2.2.0 exceptiongroup 1.3.0 fastapi 0.116.1 fastapi-cli 0.0.8 fastapi-cloud-cli 0.1.5 fastrlock 0.8.3 filelock 3.19.1 frozenlist 1.7.0 fsspec 2025.7.0 gguf 0.17.1 h11 0.16.0 hf-xet 1.1.8 httpcore 1.0.9 httptools 0.6.4 httpx 0.28.1 huggingface-hub 0.34.4 idna 3.10 interegular 0.3.3 Jinja2 3.1.6 jiter 0.10.0 jsonschema 4.25.1 jsonschema-specifications 2025.4.1 lark 1.2.2 llguidance 0.7.30 llvmlite 0.44.0 lm-format-enforcer 0.10.12 markdown-it-py 4.0.0 MarkupSafe 3.0.2 mdurl 0.1.2 mistral_common 1.8.4 mpmath 1.3.0 msgpack 1.1.1 msgspec 0.19.0 multidict 6.6.4 networkx 3.4.2 ninja 1.13.0 numba 0.61.2 numpy 2.2.6 nvidia-cublas-cu12 12.6.4.1 nvidia-cuda-cupti-cu12 12.6.80 nvidia-cuda-nvrtc-cu12 12.6.77 nvidia-cuda-runtime-cu12 12.6.77 nvidia-cudnn-cu12 9.5.1.17 nvidia-cufft-cu12 11.3.0.4 nvidia-cufile-cu12 1.11.1.6 nvidia-curand-cu12 10.3.7.77 nvidia-cusolver-cu12 11.7.1.2 nvidia-cusparse-cu12 12.5.4.2 nvidia-cusparselt-cu12 0.6.3 nvidia-nccl-cu12 2.26.2 nvidia-nvjitlink-cu12 12.6.85 nvidia-nvtx-cu12 12.6.77 openai 1.101.0 openai-harmony 0.0.4 opencv-python-headless 4.12.0.88 outlines_core 0.2.10 packaging 25.0 partial-json-parser 0.2.1.1.post6 pillow 11.3.0 pip 25.1 prometheus_client 0.22.1 prometheus-fastapi-instrumentator 7.1.0 propcache 0.3.2 protobuf 6.32.0 psutil 7.0.0 py-cpuinfo 9.0.0 pybase64 1.4.2 pycountry 24.6.1 pycparser 2.22 pydantic 2.11.7 pydantic_core 2.33.2 pydantic-extra-types 2.10.5 Pygments 2.19.2 python-dotenv 1.1.1 python-json-logger 3.3.0 python-multipart 0.0.20 PyYAML 6.0.2 pyzmq 27.0.2 ray 2.48.0 referencing 0.36.2 regex 2025.7.34 requests 2.32.5 rich 14.1.0 rich-toolkit 0.15.0 rignore 0.6.4 rpds-py 0.27.0 safetensors 0.6.2 scipy 1.15.3 sentencepiece 0.2.1 sentry-sdk 2.35.0 setproctitle 1.3.6 setuptools 78.1.1 shellingham 1.5.4 sniffio 1.3.1 soundfile 0.13.1 soxr 0.5.0.post1 starlette 0.47.3 sympy 1.14.0 tiktoken 0.11.0 tokenizers 0.21.4 torch 2.7.1 torchaudio 2.7.1 torchvision 0.22.1 tqdm 4.67.1 transformers 4.55.4 triton 3.3.1 typer 0.16.1 typing_extensions 4.14.1 typing-inspection 0.4.1 urllib3 2.5.0 uvicorn 0.35.0 uvloop 0.21.0 vllm 0.10.1.1 watchfiles 1.1.0 websockets 15.0.1 wheel 0.45.1 xformers 0.0.31 xgrammar 0.1.21 yarl 1.20.1
Thank you for the detailed response! We’ll try to reproduce the issue on our end and investigate further. We’ll get back to you as soon as we have any updates.
Appreciate your patience and for providing the logs and code snippet!
@mythzZzZ After some investigation, it looks like there might be an issue with your inference code. You can try using the code below for offline inference with vLLM — it should work more smoothly!
from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams
# Model configuration
MODEL_NAME = "your_model_path"
# Option to use HuggingFace model ID or local model path
# MODEL_NAME = "openbmb/MiniCPM-V-4"
# Load image
image = Image.open("./assets/airplane.jpeg").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Initialize LLM
llm = LLM(
model=MODEL_NAME,
max_model_len=4096,
trust_remote_code=True,
disable_mm_preprocessor_cache=True,
limit_mm_per_prompt={"image": 5}
)
# Build messages
messages = [{
"role": "user",
"content": "(<image>./</image>)\nPlease describe the content of this image"
}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Single inference
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
# For multi-image inference, use list format:
# "image": [image1, image2]
},
}
# Set stop tokens
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
# Sampling parameters
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
temperature=0.7,
top_p=0.8,
max_tokens=4096
)
# Generate results
outputs = llm.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
@ZMXJJ I used the script you provided, but encountered the same issue. my code
from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams
# Model configuration
MODEL_NAME = "/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4"
# Option to use HuggingFace model ID or local model path
# MODEL_NAME = "openbmb/MiniCPM-V-4"
# Load image
image = Image.open("/home/zhangwenkang/code/mm/LLaMA-Factory/dataset/aqy/aqy-3-test-aqy/need_count-79/电视剧-8/黑屏-4/0825112039.png").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
# Initialize LLM
llm = LLM(
model=MODEL_NAME,
max_model_len=4096,
trust_remote_code=True,
disable_mm_preprocessor_cache=True,
limit_mm_per_prompt={"image": 5}
)
# Build messages
messages = [{
"role": "user",
"content": "(<image>./</image>)\n输出画质按钮的坐标"
}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Single inference
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
# For multi-image inference, use list format:
# "image": [image1, image2]
},
}
# Set stop tokens
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
# Sampling parameters
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
temperature=0.7,
top_p=0.8,
max_tokens=4096
)
# Generate results
outputs = llm.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
my log
INFO 08-27 17:08:18 [init.py:241] Automatically detected platform cuda.
INFO 08-27 17:08:19 [utils.py:326] non-default args: {'model': '/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4', 'trust_remote_code': True, 'max_model_len': 4096, 'disable_log_stats': True, 'limit_mm_per_prompt': {'image': 5}, 'disable_mm_preprocessor_cache': True}
WARNING 08-27 17:08:19 [arg_utils.py:888] --disable-mm-preprocessor-cache is deprecated and will be removed in v0.13. Please use --mm-processor-cache-gb 0 instead.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
INFO 08-27 17:08:22 [init.py:711] Resolved architecture: MiniCPMV
INFO 08-27 17:08:22 [init.py:1750] Using max model len 4096
INFO 08-27 17:08:22 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=1844145) INFO 08-27 17:08:23 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=1844145) INFO 08-27 17:08:23 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config: model='/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4', speculative_config=None, tokenizer='/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=1844145) INFO 08-27 17:08:23 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1844145) WARNING 08-27 17:08:23 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=1844145) /home/zhangwenkang/anaconda3/envs/minicpm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:640: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
(EngineCore_0 pid=1844145) warnings.warn(
(EngineCore_0 pid=1844145) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(EngineCore_0 pid=1844145) INFO 08-27 17:08:25 [gpu_model_runner.py:1953] Starting to load model /home/zhangwenkang/code/mm/LLaMA-Factory/OpenBMB/MiniCPM-V-4...
(EngineCore_0 pid=1844145) INFO 08-27 17:08:25 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1844145) INFO 08-27 17:08:25 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1844145) INFO 08-27 17:08:25 [cuda.py:345] Using FlexAttention backend for head_size=72 on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.27it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.71it/s]
(EngineCore_0 pid=1844145)
(EngineCore_0 pid=1844145) INFO 08-27 17:08:26 [default_loader.py:262] Loading weights took 1.26 seconds
(EngineCore_0 pid=1844145) INFO 08-27 17:08:26 [gpu_model_runner.py:2007] Model loading took 7.6119 GiB and 1.409724 seconds
(EngineCore_0 pid=1844145) INFO 08-27 17:08:26 [gpu_model_runner.py:2591] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 7 video items of the maximum feature size.
(EngineCore_0 pid=1844145) INFO 08-27 17:08:31 [backends.py:548] Using cache directory: /home/zhangwenkang/.cache/vllm/torch_compile_cache/f29b1509c4/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1844145) INFO 08-27 17:08:31 [backends.py:559] Dynamo bytecode transform time: 3.11 s
(EngineCore_0 pid=1844145) INFO 08-27 17:08:34 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.317 s
(EngineCore_0 pid=1844145) INFO 08-27 17:08:34 [monitor.py:34] torch.compile takes 3.11 s in total
(EngineCore_0 pid=1844145) INFO 08-27 17:08:35 [gpu_worker.py:276] Available KV cache memory: 9.76 GiB
(EngineCore_0 pid=1844145) INFO 08-27 17:08:35 [kv_cache_utils.py:849] GPU KV cache size: 319,776 tokens
(EngineCore_0 pid=1844145) INFO 08-27 17:08:35 [kv_cache_utils.py:853] Maximum concurrency for 4,096 tokens per request: 78.07x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████| 67/67 [00:01<00:00, 34.86it/s]
(EngineCore_0 pid=1844145) INFO 08-27 17:08:37 [gpu_model_runner.py:2708] Graph capturing finished in 2 secs, took 1.91 GiB
(EngineCore_0 pid=1844145) INFO 08-27 17:08:37 [core.py:214] init engine (profile, create kv cache, warmup model) took 10.93 seconds
INFO 08-27 17:08:38 [llm.py:298] Supported_tasks: ['generate']
Adding requests: 0%| | 0/1 [00:00<?, ?it/s]/home/zhangwenkang/anaconda3/envs/minicpm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:640: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead
warnings.warn(
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.61it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.36it/s, est. speed input: 2076.48 toks/s, output: 63.84 toks/s]
509 655 581 737
ERROR 08-27 17:08:39 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
@ZMXJJ In the log, he has already successfully output the results: 509, 655, 581, 737. The error occurred only after the results were output: ERROR: Engine core process EngineCore_0 died unexpectedly, shutting down client.
That's a strange issue 🤔, and it does seem possible that it could be related to the CUDA version. You might want to try upgrading your CUDA version and see if that resolves the problem.