Error while serving fine-tuned Qwen 2.5 VL model

Open nishadsinghi opened this issue 11 months ago • 2 comments

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

[2025-05-23 13:50:43,655] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 05-23 13:50:48 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:50:48 [init.py:239] Automatically detected platform cuda.

llamafactory version: 0.9.3.dev0
Platform: Linux-6.8.0-54-generic-x86_64-with-glibc2.39
Python version: 3.9.21
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.52.1
Datasets version: 3.6.0
Accelerate version: 1.7.0
PEFT version: 0.15.2
TRL version: 0.9.6
GPU type: NVIDIA L40S
GPU number: 2
GPU memory: 44.40GB
DeepSpeed version: 0.16.9
vLLM version: 0.8.5.post1
Git commit: a9211a730eb3fc7fe0d008107a0a135c3a8734d8

Reproduction

I fine-tuned Qwen 2.5 VL 3B Instruct. Then, I tried to deploy it as follows: API_PORT=8000 llamafactory-cli api examples/inference/qwen2_5vl.yaml infer_backend=vllm vllm_enforce_eager=true Which gave me an error. I was able to serve the base model using the same command, but not the fine-tuned version.

Here is the error: [2025-05-23 13:44:13,050] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 05-23 13:44:16 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:44:16 [init.py:239] Automatically detected platform cuda. [INFO|configuration_utils.py:696] 2025-05-23 13:44:20,227 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/config.json [INFO|configuration_utils.py:770] 2025-05-23 13:44:20,236 >> Model config Qwen2_5_VLConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "text_config": { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": null, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl_text", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "use_cache": true, "use_sliding_window": false, "video_token_id": null, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "depth": 32, "fullatt_block_indexes": [ 7, 15, 23, 31 ], "hidden_act": "silu", "hidden_size": 1280, "in_channels": 3, "in_chans": 3, "initializer_range": 0.02, "intermediate_size": 3420, "model_type": "qwen2_5_vl", "num_heads": 16, "out_hidden_size": 2048, "patch_size": 14, "spatial_merge_size": 2, "spatial_patch_size": 14, "temporal_patch_size": 2, "tokens_per_second": 2, "window_size": 112 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }

[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-05-23 13:44:20,517 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|image_processing_base.py:378] 2025-05-23 13:44:20,518 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/preprocessor_config.json [INFO|image_processing_base.py:378] 2025-05-23 13:44:20,521 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/preprocessor_config.json [WARNING|logging.py:328] 2025-05-23 13:44:20,521 >> Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. [INFO|image_processing_base.py:433] 2025-05-23 13:44:20,530 >> Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 }

[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-05-23 13:44:20,788 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|video_processing_utils.py:627] 2025-05-23 13:44:20,790 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/video_preprocessor_config.json [INFO|video_processing_utils.py:683] 2025-05-23 13:44:20,797 >> Video processor Qwen2VLVideoProcessor { "_valid_kwargs_names": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "crop_size": null, "data_format": "channels_first", "default_to_square": true, "device": null, "do_center_crop": null, "do_convert_rgb": true, "do_normalize": true, "do_pad": null, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "input_data_format": null, "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "model_valid_processing_keys": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "size_divisor": null, "temporal_patch_size": 2, "video_processor_type": "Qwen2VLVideoProcessor" }

[INFO|processing_utils.py:990] 2025-05-23 13:44:21,091 >> Processor Qwen2_5_VLProcessor:

image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 }
tokenizer: Qwen2TokenizerFast(name_or_path='saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("<tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("</tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } )
video_processor: Qwen2VLVideoProcessor { "_valid_kwargs_names": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "crop_size": null, "data_format": "channels_first", "default_to_square": true, "device": null, "do_center_crop": null, "do_convert_rgb": true, "do_normalize": true, "do_pad": null, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "input_data_format": null, "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "model_valid_processing_keys": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "size_divisor": null, "temporal_patch_size": 2, "video_processor_type": "Qwen2VLVideoProcessor" }

{ "processor_class": "Qwen2_5_VLProcessor" }

[INFO|configuration_utils.py:696] 2025-05-23 13:44:21,178 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/config.json [INFO|configuration_utils.py:696] 2025-05-23 13:44:21,179 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/config.json [INFO|configuration_utils.py:770] 2025-05-23 13:44:21,180 >> Model config Qwen2_5_VLConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "text_config": { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": null, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl_text", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "use_cache": true, "use_sliding_window": false, "video_token_id": null, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "depth": 32, "fullatt_block_indexes": [ 7, 15, 23, 31 ], "hidden_act": "silu", "hidden_size": 1280, "in_channels": 3, "in_chans": 3, "initializer_range": 0.02, "intermediate_size": 3420, "model_type": "qwen2_5_vl", "num_heads": 16, "out_hidden_size": 2048, "patch_size": 14, "spatial_merge_size": 2, "spatial_patch_size": 14, "temporal_patch_size": 2, "tokens_per_second": 2, "window_size": 112 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }

INFO 05-23 13:44:38 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'. INFO 05-23 13:44:38 [config.py:1770] Defaulting to use mp for distributed inference INFO 05-23 13:44:38 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=5120. WARNING 05-23 13:44:38 [cuda.py:93] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-05-23 13:44:38,637 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:1088] 2025-05-23 13:44:38,692 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/generation_config.json [INFO|configuration_utils.py:1135] 2025-05-23 13:44:38,693 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 1e-06 }

WARNING 05-23 13:44:38 [utils.py:2382] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized INFO 05-23 13:44:51 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:44:51 [init.py:239] Automatically detected platform cuda. INFO 05-23 13:44:57 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again', speculative_config=None, tokenizer='saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0} WARNING 05-23 13:44:57 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 05-23 13:44:57 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_5f7ba305'), local_subscribe_addr='ipc:///tmp/8990c269-2086-4b12-b6ed-427acf2d1b5b', remote_subscribe_addr=None, remote_addr_ipv6=False) INFO 05-23 13:45:10 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:45:10 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:45:10 [init.py:239] Automatically detected platform cuda. INFO 05-23 13:45:10 [init.py:239] Automatically detected platform cuda. WARNING 05-23 13:45:14 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7208a1a5b070> WARNING 05-23 13:45:14 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7a3b6ea5c160> [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:14 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6f812bd4'), local_subscribe_addr='ipc:///tmp/24ee3abb-d3dc-4467-93b2-c52e4ac7bdfd', remote_subscribe_addr=None, remote_addr_ipv6=False) [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:14 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e9551106'), local_subscribe_addr='ipc:///tmp/066418ee-f44e-4bc5-976d-efb34b980723', remote_subscribe_addr=None, remote_addr_ipv6=False) [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [utils.py:1055] Found nccl from library libnccl.so.2 [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [pynccl.py:69] vLLM is using nccl==2.21.5 [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [utils.py:1055] Found nccl from library libnccl.so.2 [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [pynccl.py:69] vLLM is using nccl==2.21.5 [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ns94feza/.cache/vllm/gpu_p2p_access_cache_for_0,1.json [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ns94feza/.cache/vllm/gpu_p2p_access_cache_for_0,1.json [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_1b30d56c'), local_subscribe_addr='ipc:///tmp/d5c40de6-cde0-472e-b0b6-a60f1d32ba91', remote_subscribe_addr=None, remote_addr_ipv6=False) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1 [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0 [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [cuda.py:221] Using Flash Attention backend on V1 engine. [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [cuda.py:221] Using Flash Attention backend on V1 engine. [1;36m(VllmWorker rank=0 pid=1423872)[0;0m Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. [1;36m(VllmWorker rank=1 pid=1423873)[0;0m Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False. [1;36m(VllmWorker rank=0 pid=1423872)[0;0m Unused or unrecognized kwargs: fps, return_tensors. [1;36m(VllmWorker rank=1 pid=1423873)[0;0m Unused or unrecognized kwargs: return_tensors, fps. [1;36m(VllmWorker rank=0 pid=1423872)[0;0m WARNING 05-23 13:45:19 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. [1;36m(VllmWorker rank=1 pid=1423873)[0;0m WARNING 05-23 13:45:19 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:19 [gpu_model_runner.py:1329] Starting to load model saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again... [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:19 [gpu_model_runner.py:1329] Starting to load model saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again... huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:19 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] [1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:19 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config [] [1;36m(VllmWorker rank=0 pid=1423872)[0;0m Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] WorkerProc failed to start. [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] Traceback (most recent call last): [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 409, in worker_main [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] worker = WorkerProc(*args, **kwargs) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 306, in init [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] self.worker.load_model() [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] self.model_runner.load_model() [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] self.model = get_model(vllm_config=self.vllm_config) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] return loader.load_model(vllm_config=vllm_config) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] loaded_weights = model.load_weights( [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1126, in load_weights [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] autoloaded_weights = set(self._load_module("", self.module, weights)) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] yield from self._load_module(prefix, [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 195, in _load_module [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] loaded_params = module_load_weights(weights) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 486, in load_weights [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] return loader.load_weights(weights) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] autoloaded_weights = set(self._load_module("", self.module, weights)) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] yield from self._load_module(prefix, [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 195, in _load_module [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] loaded_params = module_load_weights(weights) [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 405, in load_weights [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] param = params_dict[name] [1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] KeyError: 'language_model.layers.19.input_layernorm.weight' [1;36m(VllmWorker rank=0 pid=1423872)[0;0m Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s] [1;36m(VllmWorker rank=0 pid=1423872)[0;0m [rank0]:[W523 13:45:20.315050792 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) ERROR 05-23 13:45:21 [core.py:396] EngineCore failed to start. ERROR 05-23 13:45:21 [core.py:396] Traceback (most recent call last): ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core ERROR 05-23 13:45:21 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs) ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 329, in init ERROR 05-23 13:45:21 [core.py:396] super().init(vllm_config, executor_class, log_stats, ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 64, in init ERROR 05-23 13:45:21 [core.py:396] self.model_executor = executor_class(vllm_config) ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 52, in init ERROR 05-23 13:45:21 [core.py:396] self._init_executor() ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor ERROR 05-23 13:45:21 [core.py:396] self.workers = WorkerProc.wait_for_ready(unready_workers) ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready ERROR 05-23 13:45:21 [core.py:396] raise e from None ERROR 05-23 13:45:21 [core.py:396] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. Process EngineCore_0: Traceback (most recent call last): File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core raise e File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core engine_core = EngineCoreProc(*args, **kwargs) File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 329, in init super().init(vllm_config, executor_class, log_stats, File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 64, in init self.model_executor = executor_class(vllm_config) File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 52, in init self._init_executor() File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor self.workers = WorkerProc.wait_for_ready(unready_workers) File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready raise e from None Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause. Traceback (most recent call last): File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/weakref.py", line 667, in _exitfunc f() File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/weakref.py", line 591, in call return info.func(*info.args, **(info.kwargs or {})) File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 228, in shutdown for w in self.workers: AttributeError: 'MultiprocExecutor' object has no attribute 'workers' Traceback (most recent call last): File "/home/ns94feza/miniconda3/envs/llama-factory/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/mnt/beegfs/hdd/mirror/home/ns94feza/LLaMA-Factory/src/llamafactory/cli.py", line 115, in main COMMAND_MAPcommand File "/mnt/beegfs/hdd/mirror/home/ns94feza/LLaMA-Factory/src/llamafactory/api/app.py", line 128, in run_api chat_model = ChatModel() File "/mnt/beegfs/hdd/mirror/home/ns94feza/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 55, in init self.engine: BaseEngine = VllmEngine(model_args, data_args, finetuning_args, generating_args) File "/mnt/beegfs/hdd/mirror/home/ns94feza/LLaMA-Factory/src/llamafactory/chat/vllm_engine.py", line 97, in init self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args)) File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 684, in from_engine_args return async_engine_cls.from_vllm_config( File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config return cls( File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/async_llm.py", line 118, in init self.engine_core = core_client_class( File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core_client.py", line 642, in init super().init( File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core_client.py", line 398, in init self._wait_for_engine_startup() File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above. /home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Others

No response

May 23 '25 11:05 nishadsinghi

same problem while serving fine-tuned qwen2.5 vl 3B model

(VllmWorker rank=2 pid=21227) INFO 05-26 07:08:03 [config.py:3614] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] WorkerProc failed to start.
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] Traceback (most recent call last):
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 409, in worker_main
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 306, in __init__
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     self.worker.load_model()
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     self.model_runner.load_model()
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     return loader.load_model(vllm_config=vllm_config)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     loaded_weights = model.load_weights(
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1126, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     yield from self._load_module(prefix,
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     loaded_params = module_load_weights(weights)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 486, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     return loader.load_weights(weights)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     yield from self._load_module(prefix,
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     loaded_params = module_load_weights(weights)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 405, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]     param = params_dict[name]
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435]             ~~~~~~~~~~~^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] KeyError: 'language_model.layers.19.input_layernorm.weight'
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=21225) 
[rank0]:[W526 07:08:04.170710658 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 05-26 07:08:06 [core.py:396] EngineCore failed to start.
ERROR 05-26 07:08:06 [core.py:396] Traceback (most recent call last):
ERROR 05-26 07:08:06 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-26 07:08:06 [core.py:396]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-26 07:08:06 [core.py:396]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-26 07:08:06 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-26 07:08:06 [core.py:396]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-26 07:08:06 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
ERROR 05-26 07:08:06 [core.py:396]     self.model_executor = executor_class(vllm_config)
ERROR 05-26 07:08:06 [core.py:396]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-26 07:08:06 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 05-26 07:08:06 [core.py:396]     self._init_executor()
ERROR 05-26 07:08:06 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
ERROR 05-26 07:08:06 [core.py:396]     self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 05-26 07:08:06 [core.py:396]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-26 07:08:06 [core.py:396]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
ERROR 05-26 07:08:06 [core.py:396]     raise e from None
ERROR 05-26 07:08:06 [core.py:396] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
    self.workers = WorkerProc.wait_for_ready(unready_workers)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
    raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
  File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
    f()
  File "/usr/lib/python3.12/weakref.py", line 590, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 228, in shutdown
    for w in self.workers:
             ^^^^^^^^^^^^
AttributeError: 'MultiprocExecutor' object has no attribute 'workers'
Traceback (most recent call last):
  File "/app/scripts/vllm_infer.py", line 199, in <module>
    fire.Fire(vllm_infer)
  File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/scripts/vllm_infer.py", line 112, in vllm_infer
    llm = LLM(**engine_args)
          ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1161, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 247, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 510, in from_engine_args
    return engine_cls.from_vllm_config(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 112, in from_vllm_config
    return cls(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 92, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 73, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 494, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
root@c08e3a430e9c:/app# /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
root@c08e3a430e9c:/app# ll

Before 3B, I've tried serving fine-tune 7B with lora with no error

May 26 '25 08:05 XueSongTap

Currently, there are some bugs in Transformers 4.52.0-4.52.3 when using vLLM to run inference on fine-tuned models. Our patch is released along with Transformers 4.52.4: https://github.com/huggingface/transformers/pull/38385

You can downgrade Transformers to version 4.51.3 or upgrade to Transformers 4.52.4 and train again to avoid this issue.

May 26 '25 16:05 hiyouga

问题解决了吗，不需要重新训练能推理了吗

May 28 '25 09:05 alanMachineLeraning

问题解决了吗，不需要重新训练能推理了吗

我也遇到了这个问题，经测试，在升级 transformer 库到 4.52.4 后，是需要重新训练的，否则加载 transformer 原版本训练的模型还是会报错的

Jun 20 '25 07:06 liaobaodeng

我是训练的qwen2.5-VL的lora，transformers从4.52.1升级到4.52.4之后，重新合并了lora之后就可以用vllm正常推理了，应该只是模型网络层的键名没对应上。

Jun 24 '25 08:06 lqo-l

我将版本降为 4.51.3 之后训练就没问题了，不过现在有别人 release 的 model 出现这个问题，就比较难办了。

Oct 29 '25 04:10 w-zhih