vllm
vllm copied to clipboard
[Bugfix][Model] fix Phi3Small model only support v0
When i use python3 -m vllm.entrypoints.cli.main serve microsoft/Phi-3-small-8k-instruct --trust-remote-code --gpu-memory-utilization 0.95 command to run this model, got this error.
👋 Hi! Thank you for contributing to the vLLM project.
💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.
🚀
Why does it only support V0? Can you update the PR description?
Why does it only support V0? Can you update the PR description?
According to the error, it is because flashattention is not supported.
Can you show the logs?
If it is only about unsupported head size then the model still can support V1 if we implement an attention backend for it.
DEBUG 04-11 23:28:36 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-11 23:28:36 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-11 23:28:36 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:28:36 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-11 23:28:36 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-11 23:28:36 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-11 23:28:36 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-11 23:28:36 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-11 23:28:36 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-11 23:28:36 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-11 23:28:36 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-11 23:28:36 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:28:36 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-11 23:28:36 [__init__.py:239] Automatically detected platform cuda.
DEBUG 04-11 23:28:39 [utils.py:135] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
DEBUG 04-11 23:28:39 [__init__.py:28] No plugins for group vllm.general_plugins found.
INFO 04-11 23:28:39 [api_server.py:1034] vLLM API server version 0.8.3rc2.dev139+gf8f9c0ba6.d20250411
INFO 04-11 23:28:39 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='microsoft/Phi-3-small-8k-instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='microsoft/Phi-3-small-8k-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f285c7374c0>)
INFO 04-11 23:28:54 [config.py:676] This model supports multiple tasks: {'embed', 'classify', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
DEBUG 04-11 23:28:54 [arg_utils.py:1711] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 04-11 23:28:54 [arg_utils.py:1718] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 04-11 23:28:54 [config.py:1885] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 04-11 23:29:03 [tokenizer.py:248] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
DEBUG 04-11 23:29:13 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-11 23:29:13 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-11 23:29:13 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:29:13 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-11 23:29:13 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-11 23:29:13 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-11 23:29:13 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-11 23:29:13 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-11 23:29:13 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-11 23:29:13 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-11 23:29:13 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-11 23:29:13 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 23:29:13 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-11 23:29:13 [__init__.py:239] Automatically detected platform cuda.
INFO 04-11 23:29:23 [core.py:62] Initializing a V1 LLM engine (v0.8.3rc2.dev139+gf8f9c0ba6.d20250411) with config: model='microsoft/Phi-3-small-8k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-small-8k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=microsoft/Phi-3-small-8k-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
DEBUG 04-11 23:29:23 [__init__.py:28] No plugins for group vllm.general_plugins found.
DEBUG 04-11 23:29:24 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
WARNING 04-11 23:29:24 [utils.py:2430] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f974d8c6630>
DEBUG 04-11 23:29:24 [config.py:3884] enabled custom ops: Counter()
DEBUG 04-11 23:29:24 [config.py:3886] disabled custom ops: Counter()
DEBUG 04-11 23:29:25 [parallel_state.py:820] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.233.99.147:44459 backend=nccl
INFO 04-11 23:29:25 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-11 23:29:25 [cuda.py:221] Using Flash Attention backend on V1 engine.
DEBUG 04-11 23:29:25 [config.py:3884] enabled custom ops: Counter()
DEBUG 04-11 23:29:25 [config.py:3886] disabled custom ops: Counter()
INFO 04-11 23:29:25 [gpu_model_runner.py:1280] Starting to load model microsoft/Phi-3-small-8k-instruct...
INFO 04-11 23:29:25 [selector.py:119] Using BlocksparseFlashAttention backend.
INFO 04-11 23:29:26 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
DEBUG 04-11 23:29:26 [config.py:3884] enabled custom ops: Counter()
DEBUG 04-11 23:29:26 [config.py:3886] disabled custom ops: Counter({'rotary_embedding': 1})
WARNING 04-11 23:29:26 [config.py:3896] `torch.compile` is turned on, but the model microsoft/Phi-3-small-8k-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
INFO 04-11 23:29:27 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:02, 1.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.09it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.37it/s]
INFO 04-11 23:29:30 [loader.py:458] Loading weights took 2.94 seconds
INFO 04-11 23:29:31 [gpu_model_runner.py:1295] Model loading took 13.7729 GiB and 5.529012 seconds
ERROR 04-11 23:29:31 [core.py:388] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/engine/core.py", line 379, in run_engine_core
ERROR 04-11 23:29:31 [core.py:388] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/engine/core.py", line 321, in __init__
ERROR 04-11 23:29:31 [core.py:388] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/engine/core.py", line 72, in __init__
ERROR 04-11 23:29:31 [core.py:388] self._initialize_kv_caches(vllm_config)
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/engine/core.py", line 134, in _initialize_kv_caches
ERROR 04-11 23:29:31 [core.py:388] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-11 23:29:31 [core.py:388] output = self.collective_rpc("determine_available_memory")
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-11 23:29:31 [core.py:388] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/utils.py", line 2364, in run_method
ERROR 04-11 23:29:31 [core.py:388] return func(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-11 23:29:31 [core.py:388] return func(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 04-11 23:29:31 [core.py:388] self.model_runner.profile_run()
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1595, in profile_run
ERROR 04-11 23:29:31 [core.py:388] hidden_states = self._dummy_run(self.max_num_tokens)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-11 23:29:31 [core.py:388] return func(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1445, in _dummy_run
ERROR 04-11 23:29:31 [core.py:388] hidden_states = model(
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388] return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388] return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 430, in forward
ERROR 04-11 23:29:31 [core.py:388] output_hidden_states = self.model(
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388] return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388] return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 350, in forward
ERROR 04-11 23:29:31 [core.py:388] hidden_states = layer(positions, hidden_states)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388] return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388] return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 287, in forward
ERROR 04-11 23:29:31 [core.py:388] hidden_states = self.self_attn(
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388] return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388] return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/model_executor/models/phi3_small.py", line 249, in forward
ERROR 04-11 23:29:31 [core.py:388] attn_output = self.attn(q, k, v)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-11 23:29:31 [core.py:388] return self._call_impl(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-11 23:29:31 [core.py:388] return forward_call(*args, **kwargs)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/attention/layer.py", line 229, in forward
ERROR 04-11 23:29:31 [core.py:388] return torch.ops.vllm.unified_attention(
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-11 23:29:31 [core.py:388] return self._op(*args, **(kwargs or {}))
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/attention/layer.py", line 342, in unified_attention
ERROR 04-11 23:29:31 [core.py:388] return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/attention/backends/blocksparse_attn.py", line 412, in forward
ERROR 04-11 23:29:31 [core.py:388] if prefill_meta := attn_metadata.prefill_metadata:
ERROR 04-11 23:29:31 [core.py:388] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-11 23:29:31 [core.py:388] AttributeError: 'NoneType' object has no attribute 'prefill_metadata'
ERROR 04-11 23:29:31 [core.py:388]
CRITICAL 04-11 23:29:31 [core_client.py:360] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
cc @LucasWilkinson
im getting:
(vllm) lwilkinson@beaker:~/code/vllm$ python3 -m vllm.entrypoints.cli.main serve microsoft/Phi-3-small-8k-instruct --trust-remote-code --gpu-memory-utilization 0.95
INFO 04-14 20:31:14 [__init__.py:239] Automatically detected platform cuda.
INFO 04-14 20:31:15 [api_server.py:1034] vLLM API server version 0.8.5.dev9+g7b5ecf79b
INFO 04-14 20:31:15 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='microsoft/Phi-3-small-8k-instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='microsoft/Phi-3-small-8k-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, enable_chunked_prefill=None, multi_step_stream_outputs=True, scheduling_policy='fcfs', disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7eac6c785fc0>)
INFO 04-14 20:31:30 [config.py:689] This model supports multiple tasks: {'classify', 'reward', 'embed', 'score', 'generate'}. Defaulting to 'generate'.
INFO 04-14 20:31:30 [config.py:1948] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-14 20:31:32 [tokenizer.py:248] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/lwilkinson/code/vllm/vllm/entrypoints/cli/main.py", line 57, in <module>
main()
File "/home/lwilkinson/code/vllm/vllm/entrypoints/cli/main.py", line 51, in main
args.dispatch_function(args)
File "/home/lwilkinson/code/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/home/lwilkinson/code/vllm/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/home/lwilkinson/code/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/home/lwilkinson/code/vllm/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
File "/home/lwilkinson/code/vllm/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
return cls(
File "/home/lwilkinson/code/vllm/vllm/v1/engine/async_llm.py", line 102, in __init__
self.engine_core = EngineCoreClient.make_client(
File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 71, in make_client
return AsyncMPClient(vllm_config, executor_class, log_stats)
File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 604, in __init__
super().__init__(
File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 400, in __init__
self._init_core_engines(vllm_config, new_core_engine,
File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 448, in _init_core_engines
core_engine = new_core_engine(
File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 395, in <lambda>
new_core_engine = lambda index, local_dp_rank=None: CoreEngine(
File "/home/lwilkinson/code/vllm/vllm/v1/engine/core_client.py", line 275, in __init__
self.proc_handle = BackgroundProcHandle(
File "/home/lwilkinson/code/vllm/vllm/v1/utils.py", line 120, in __init__
self.proc.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/home/lwilkinson/code/vllm/vllm/transformers_utils/config.py", line 615, in _reduce_config
return (pickle.loads, (cloudpickle.dumps(config), ))
File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1537, in dumps
cp.dump(obj)
File "/home/lwilkinson/.venvs/vllm/lib/python3.10/site-packages/cloudpickle/cloudpickle.py", line 1303, in dump
return super().dump(obj)
TypeError: cannot pickle '_thread.RLock' object
any tips on reproducing?
I look your version is 0.8.5, i will use this version to test again.
This pull request has merge conflicts that must be resolved before it can be merged. Please rebase the PR, @lengrongfu.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
😭 I can still reproduce this problem on version 0.8.4, and I don't know the reason. Can you help verify it? @DarkLight1337
python3 collect_env.py resoult.
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35
Python version: 3.12.10 (main, Apr 9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-134-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A800 80GB PCIe
Nvidia driver version: 550.127.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
CPU family: 6
Model: 106
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
Stepping: 6
BogoMIPS: 5199.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
L1d cache: 1.5 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 40 MiB (32 instances)
L3 cache: 48 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerability Gather data sampling: Vulnerable: No microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer-python==0.2.1.post2+cu124torch2.6
[pip3] numpy==2.2.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.4
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_VISIBLE_DEVICES=GPU-95aa552b-d1ea-39a2-62ae-e0e44fc85aaf
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_VERSION=2.20.5-1
CUDA_DEVICE_SM_LIMIT=0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.4.0
CUDA_OVERSUBSCRIBE=true
CUDA_DEVICE_MEMORY_LIMIT_0=40000m
CUDA_DEVICE_MEMORY_SHARED_CACHE=/usr/local/vgpu/426425bb-d0a1-4e69-9f1d-db7036c7eb31.cache
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
cc @LucasWilkinson
I've also encountered the same problem when running Phi-3-small-8k-instruct. Have you fixed it or do you have any previous version that can run the model?
The issue is similar to https://github.com/vllm-project/vllm/issues/15973, the crash happens in the profiling run which does not set attn_metadata. See from the stack trace,
ERROR 04-11 23:29:31 [core.py:388] File "/root/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1595, in profile_run ERROR 04-11 23:29:31 [core.py:388] hidden_states = self._dummy_run(self.max_num_tokens)
In the v1 attention backends there is: https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/v1/attention/backends/flash_attn.py#L557-L559
The issue is you have vllm/attention/backends/blocksparse_attn.py and there is no such thing as vllm/v1/attention/backends/blocksparse_attn.py. This seems to be required for phi3small due to the alternating block sparse layers.
https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/model_executor/models/phi3_small.py#L221-L228
https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/attention/layer.py#L126-L132
https://github.com/vllm-project/vllm/blob/b78f844a6743732b58022f2f84858d61b40b5913/vllm/attention/selector.py#L118-L122
I opened an issue for this: https://github.com/vllm-project/vllm/issues/18815.
You could also try slapping
if attn_metadata is None:
# Profiling run.
return output
into the v0 vllm/attention/backends/blocksparse_attn.py and see if that works (maybe not, the differences between the v0 and v1 backends look like a lot).
I ran into a similar issues and manage to partially solve it by using engin v0: https://github.com/vllm-project/vllm/issues/19665#issuecomment-2980925646
Does the unmerged branch solves the Phi-3 issues?