vllm [Bugfix][Model] fix Phi3Small model only support v0

When i use python3 -m vllm.entrypoints.cli.main serve microsoft/Phi-3-small-8k-instruct --trust-remote-code --gpu-memory-utilization 0.95 command to run this model, got this error.

Apr 11 '25 15:04 lengrongfu

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Apr 11 '25 15:04 github-actions[bot]

Why does it only support V0? Can you update the PR description?

Apr 11 '25 22:04 DarkLight1337

Why does it only support V0? Can you update the PR description?

According to the error, it is because flashattention is not supported.

Apr 12 '25 01:04 lengrongfu

Can you show the logs?

Apr 12 '25 01:04 DarkLight1337

If it is only about unsupported head size then the model still can support V1 if we implement an attention backend for it.

Apr 12 '25 01:04 DarkLight1337

DEBUG 04-11 23:28:36 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-11 23:28:36 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-11 23:28:36 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:28:36 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-11 23:28:36 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-11 23:28:36 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-11 23:28:36 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-11 23:28:36 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-11 23:28:36 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-11 23:28:36 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-11 23:28:36 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:28:36 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-11 23:28:36 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-11 23:28:39 [utils.py:135] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 04-11 23:28:39 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-11 23:28:39 [api_server.py:1034] vLLM API server version 0.8.3rc2.dev139+gf8f9c0ba6.d20250411
INFO 04-11 23:28:39 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='microsoft/Phi-3-small-8k-instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='microsoft/Phi-3-small-8k-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f285c7374c0>)
INFO 04-11 23:28:54 [config.py:676] This model supports multiple tasks: {'embed', 'classify', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
DEBUG 04-11 23:28:54 [arg_utils.py:1711] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 04-11 23:28:54 [arg_utils.py:1718] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 04-11 23:28:54 [config.py:1885] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 04-11 23:29:03 [tokenizer.py:248] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
DEBUG 04-11 23:29:13 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-11 23:29:13 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-11 23:29:13 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:29:13 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-11 23:29:13 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-11 23:29:13 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-11 23:29:13 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-11 23:29:13 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-11 23:29:13 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-11 23:29:13 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-11 23:29:13 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:29:13 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-11 23:29:13 [__init__.py:239] Automatically detected platform cuda.
INFO 04-11 23:29:23 [core.py:62] Initializing a V1 LLM engine (v0.8.3rc2.dev139+gf8f9c0ba6.d20250411) with config: model='microsoft/Phi-3-small-8k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-small-8k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=microsoft/Phi-3-small-8k-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
DEBUG 04-11 23:29:23 [__init__.py:28] No plugins for group vllm.general_plugins found.
DEBUG 04-11 23:29:24 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
WARNING 04-11 23:29:24 [utils.py:2430] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f974d8c6630>
DEBUG 04-11 23:29:24 [config.py:3884] enabled custom ops: Counter()
DEBUG 04-11 23:29:24 [config.py:3886] disabled custom ops: Counter()
DEBUG 04-11 23:29:25 [parallel_state.py:820] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.233.99.147:44459 backend=nccl
INFO 04-11 23:29:25 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-11 23:29:25 [cuda.py:221] Using Flash Attention backend on V1 engine.
DEBUG 04-11 23:29:25 [config.py:3884] enabled custom ops: Counter()
DEBUG 04-11 23:29:25 [config.py:3886] disabled custom ops: Counter()
INFO 04-11 23:29:25 [gpu_model_runner.py:1280] Starting to load model microsoft/Phi-3-small-8k-instruct...
INFO 04-11 23:29:25 [selector.py:119] Using BlocksparseFlashAttention backend.
INFO 04-11 23:29:26 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
DEBUG 04-11 23:29:26 [config.py:3884] enabled custom ops: Counter()
DEBUG 04-11 23:29:26 [config.py:3886] disabled custom ops: Counter({'rotary_embedding': 1})
WARNING 04-11 23:29:26 [config.py:3896] `torch.compile` is turned on, but the model microsoft/Phi-3-small-8k-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
INFO 04-11 23:29:27 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.16it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.09it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]

INFO 04-11 23:29:30 [loader.py:458] Loading weights took 2.94 seconds
INFO 04-11 23:29:31 [gpu_model_runner.py:1295] Model loading took 13.7729 GiB and 5.529012 seconds
ERROR 04-11 23:29:31 [core.py:388] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/engine/core.py", line 379, in run_engine_core
ERROR 04-11 23:29:31 [core.py:388]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/engine/core.py", line 321, in __init__
ERROR 04-11 23:29:31 [core.py:388]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/engine/core.py", line 72, in __init__
ERROR 04-11 23:29:31 [core.py:388]     self._initialize_kv_caches(vllm_config)
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
ERROR 04-11 23:29:31 [core.py:388]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-11 23:29:31 [core.py:388]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-11 23:29:31 [core.py:388]     output = self.collective_rpc("determine_available_memory")
ERROR 04-11 23:29:31 [core.py:388]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-11 23:29:31 [core.py:388]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-11 23:29:31 [core.py:388]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/utils.py", line 2364, in run_method
ERROR 04-11 23:29:31 [core.py:388]     return func(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-11 23:29:31 [core.py:388]     return func(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 04-11 23:29:31 [core.py:388]     self.model_runner.profile_run()
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1595, in profile_run
ERROR 04-11 23:29:31 [core.py:388]     hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 04-11 23:29:31 [core.py:388]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-11 23:29:31 [core.py:388]     return func(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1445, in _dummy_run
ERROR 04-11 23:29:31 [core.py:388]     hidden_states = model(
ERROR 04-11 23:29:31 [core.py:388]                     ^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388]     return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388]     return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 430, in forward
ERROR 04-11 23:29:31 [core.py:388]     output_hidden_states = self.model(
ERROR 04-11 23:29:31 [core.py:388]                            ^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388]     return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388]     return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 350, in forward
ERROR 04-11 23:29:31 [core.py:388]     hidden_states = layer(positions, hidden_states)
ERROR 04-11 23:29:31 [core.py:388]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388]     return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388]     return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 287, in forward
ERROR 04-11 23:29:31 [core.py:388]     hidden_states = self.self_attn(
ERROR 04-11 23:29:31 [core.py:388]                     ^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388]     return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388]     return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 249, in forward
ERROR 04-11 23:29:31 [core.py:388]     attn_output = self.attn(q, k, v)
ERROR 04-11 23:29:31 [core.py:388]                   ^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388]     return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388]     return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/attention/layer.py", line 229, in forward
ERROR 04-11 23:29:31 [core.py:388]     return torch.ops.vllm.unified_attention(
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-11 23:29:31 [core.py:388]     return self._op(*args, **(kwargs or {}))
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/attention/layer.py", line 342, in unified_attention
ERROR 04-11 23:29:31 [core.py:388]     return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
ERROR 04-11 23:29:31 [core.py:388]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388]   File "/root/code/vllm/vllm/attention/backends/blocksparse_attn.py", line 412, in forward
ERROR 04-11 23:29:31 [core.py:388]     if prefill_meta := attn_metadata.prefill_metadata:
ERROR 04-11 23:29:31 [core.py:388]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] AttributeError: 'NoneType' object has no attribute 'prefill_metadata'
ERROR 04-11 23:29:31 [core.py:388] 
CRITICAL 04-11 23:29:31 [core_client.py:360] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Apr 12 '25 06:04 lengrongfu

cc @LucasWilkinson

Apr 12 '25 12:04 DarkLight1337

im getting:

(vllm) lwilkinson@beaker:~/code/vllm$ python3 -m vllm.entrypoints.cli.main serve microsoft/Phi-3-small-8k-instruct --trust-remote-code --gpu-memory-utilization 0.95
INFO 04-14 20:31:14 [__init__.py:239] Automatically detected platform cuda.
INFO 04-14 20:31:15 [api_server.py:1034] vLLM API server version 0.8.5.dev9+g7b5ecf79b
INFO 04-14 20:31:15 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='microsoft/Phi-3-small-8k-instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='microsoft/Phi-3-small-8k-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, enable_chunked_prefill=None, multi_step_stream_outputs=True, scheduling_policy='fcfs', disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7eac6c785fc0>)
INFO 04-14 20:31:30 [config.py:689] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 04-14 20:31:30 [config.py:1948] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-14 20:31:32 [tokenizer.py:248] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/lwilkinson/code/vllm/vllm/entrypoints/cli/main.py", line 57, in <module>
    main()
  File "/home/lwilkinson/code/vllm/vllm/entrypoints/cli/main.py", line 51, in main
    args.dispatch_function(args)
  File "/home/lwilkinson/code/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/lwilkinson/code/vllm/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/lwilkinson/code/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/lwilkinson/code/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 71, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 604, in __init__
    super().__init__(
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 400, in __init__
    self._init_core_engines(vllm_config, new_core_engine,
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 448, in _init_core_engines
    core_engine = new_core_engine(
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 395, in <lambda>
    new_core_engine = lambda index, local_dp_rank=None: CoreEngine(
  File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 275, in __init__
    self.proc_handle = BackgroundProcHandle(
  File "/home/lwilkinson/code/vllm/vllm/v1/utils.py", line 120, in __init__
    self.proc.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/home/lwilkinson/code/vllm/vllm/transformers_utils/config.py", line 615, in _reduce_config
    return (pickle.loads, (cloudpickle.dumps(config), ))
  File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1537, in dumps
    cp.dump(obj)
  File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1303, in dump
    return super().dump(obj)
TypeError: cannot pickle '_thread.RLock' object

any tips on reproducing?

Apr 14 '25 20:04 LucasWilkinson

I look your version is 0.8.5, i will use this version to test again.

Apr 15 '25 01:04 lengrongfu

This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Apr 22 '25 03:04 mergify[bot]

😭 I can still reproduce this problem on version 0.8.4, and I don't know the reason. Can you help verify it? @DarkLight1337

python3 collect_env.py resoult.

PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35

Python version: 3.12.10 (main, Apr  9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-134-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A800 80GB PCIe
Nvidia driver version: 550.127.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        45 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
CPU family:                           6
Model:                                106
Thread(s) per core:                   1
Core(s) per socket:                   32
Socket(s):                            1
Stepping:                             6
BogoMIPS:                             5199.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
L1d cache:                            1.5 MiB (32 instances)
L1i cache:                            1 MiB (32 instances)
L2 cache:                             40 MiB (32 instances)
L3 cache:                             48 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-31
Vulnerability Gather data sampling:   Vulnerable: No microcode
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] numpy==2.2.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.20.5-1
CUDA_DEVICE_SM_LIMIT=0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.0
CUDA_OVERSUBSCRIBE=true
CUDA_DEVICE_MEMORY_LIMIT_0=40000m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/usr/local/vgpu/426425bb-d0a1-4e69-9f1d-db7036c7eb31.cache
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

Apr 28 '25 10:04 lengrongfu

cc @LucasWilkinson

Apr 28 '25 10:04 DarkLight1337

I've also encountered the same problem when running Phi-3-small-8k-instruct. Have you fixed it or do you have any previous version that can run the model?

May 24 '25 06:05 thuEcstasy

The issue is similar to https://github.com/vllm-project/vllm/issues/15973, the crash happens in the profiling run which does not set attn_metadata. See from the stack trace,

ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1595, in profile_run ERROR 04-11 23:29:31 [core.py:388] hidden_states = self._dummy_run(self.max_num_tokens)

In the v1 attention backends there is: https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/v1/attention/backends/flash_attn.py#L557-L559

The issue is you have vllm/attention/backends/blocksparse_attn.py and there is no such thing as vllm/v1/attention/backends/blocksparse_attn.py. This seems to be required for phi3small due to the alternating block sparse layers.

https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/model_executor/models/phi3_small.py#L221-L228

https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/attention/layer.py#L126-L132

https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/attention/selector.py#L118-L122

I opened an issue for this: https://github.com/vllm-project/vllm/issues/18815.

You could also try slapping

        if attn_metadata is None:
            # Profiling run.
            return output

into the v0 vllm/attention/backends/blocksparse_attn.py and see if that works (maybe not, the differences between the v0 and v1 backends look like a lot).

May 28 '25 06:05 llllvvuu

I ran into a similar issues and manage to partially solve it by using engin v0: https://github.com/vllm-project/vllm/issues/19665#issuecomment-2980925646

Does the unmerged branch solves the Phi-3 issues?

Jun 17 '25 16:06 dnaihao

vllm vllm copied to clipboard

[Bugfix][Model] fix Phi3Small model only support v0

vllm
vllm copied to clipboard