Error while serving fine-tuned Qwen 2.5 VL model
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
[2025-05-23 13:50:43,655] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 05-23 13:50:48 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:50:48 [init.py:239] Automatically detected platform cuda.
-
llamafactoryversion: 0.9.3.dev0 - Platform: Linux-6.8.0-54-generic-x86_64-with-glibc2.39
- Python version: 3.9.21
- PyTorch version: 2.6.0+cu124 (GPU)
- Transformers version: 4.52.1
- Datasets version: 3.6.0
- Accelerate version: 1.7.0
- PEFT version: 0.15.2
- TRL version: 0.9.6
- GPU type: NVIDIA L40S
- GPU number: 2
- GPU memory: 44.40GB
- DeepSpeed version: 0.16.9
- vLLM version: 0.8.5.post1
- Git commit: a9211a730eb3fc7fe0d008107a0a135c3a8734d8
Reproduction
I fine-tuned Qwen 2.5 VL 3B Instruct. Then, I tried to deploy it as follows:
API_PORT=8000 llamafactory-cli api examples/inference/qwen2_5vl.yaml infer_backend=vllm vllm_enforce_eager=true
Which gave me an error. I was able to serve the base model using the same command, but not the fine-tuned version.
Here is the error: [2025-05-23 13:44:13,050] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 05-23 13:44:16 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-23 13:44:16 [init.py:239] Automatically detected platform cuda. [INFO|configuration_utils.py:696] 2025-05-23 13:44:20,227 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/config.json [INFO|configuration_utils.py:770] 2025-05-23 13:44:20,236 >> Model config Qwen2_5_VLConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "text_config": { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": null, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl_text", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "use_cache": true, "use_sliding_window": false, "video_token_id": null, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "depth": 32, "fullatt_block_indexes": [ 7, 15, 23, 31 ], "hidden_act": "silu", "hidden_size": 1280, "in_channels": 3, "in_chans": 3, "initializer_range": 0.02, "intermediate_size": 3420, "model_type": "qwen2_5_vl", "num_heads": 16, "out_hidden_size": 2048, "patch_size": 14, "spatial_merge_size": 2, "spatial_patch_size": 14, "temporal_patch_size": 2, "tokens_per_second": 2, "window_size": 112 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,250 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2299] 2025-05-23 13:44:20,517 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:378] 2025-05-23 13:44:20,518 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/preprocessor_config.json
[INFO|image_processing_base.py:378] 2025-05-23 13:44:20,521 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/preprocessor_config.json
[WARNING|logging.py:328] 2025-05-23 13:44:20,521 >> Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
[INFO|image_processing_base.py:433] 2025-05-23 13:44:20,530 >> Image processor Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 12845056,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2_5_VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 12845056,
"shortest_edge": 3136
},
"temporal_patch_size": 2
}
[INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:20,531 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-05-23 13:44:20,788 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|video_processing_utils.py:627] 2025-05-23 13:44:20,790 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/video_preprocessor_config.json [INFO|video_processing_utils.py:683] 2025-05-23 13:44:20,797 >> Video processor Qwen2VLVideoProcessor { "_valid_kwargs_names": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "crop_size": null, "data_format": "channels_first", "default_to_square": true, "device": null, "do_center_crop": null, "do_convert_rgb": true, "do_normalize": true, "do_pad": null, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "input_data_format": null, "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "model_valid_processing_keys": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "size_divisor": null, "temporal_patch_size": 2, "video_processor_type": "Qwen2VLVideoProcessor" }
[INFO|processing_utils.py:990] 2025-05-23 13:44:21,091 >> Processor Qwen2_5_VLProcessor:
-
image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 }
-
tokenizer: Qwen2TokenizerFast(name_or_path='saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("<tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("</tool_call>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } )
-
video_processor: Qwen2VLVideoProcessor { "_valid_kwargs_names": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "crop_size": null, "data_format": "channels_first", "default_to_square": true, "device": null, "do_center_crop": null, "do_convert_rgb": true, "do_normalize": true, "do_pad": null, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "input_data_format": null, "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "model_valid_processing_keys": [ "do_convert_rgb", "do_resize", "size", "size_divisor", "default_to_square", "resample", "do_rescale", "rescale_factor", "do_normalize", "image_mean", "image_std", "do_pad", "do_center_crop", "crop_size", "data_format", "input_data_format", "device", "min_pixels", "max_pixels", "patch_size", "temporal_patch_size", "merge_size" ], "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "size_divisor": null, "temporal_patch_size": 2, "video_processor_type": "Qwen2VLVideoProcessor" }
{ "processor_class": "Qwen2_5_VLProcessor" }
[INFO|configuration_utils.py:696] 2025-05-23 13:44:21,178 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/config.json [INFO|configuration_utils.py:696] 2025-05-23 13:44:21,179 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/config.json [INFO|configuration_utils.py:770] 2025-05-23 13:44:21,180 >> Model config Qwen2_5_VLConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "text_config": { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 2048, "image_token_id": null, "initializer_range": 0.02, "intermediate_size": 11008, "max_position_embeddings": 128000, "max_window_layers": 70, "model_type": "qwen2_5_vl_text", "num_attention_heads": 16, "num_hidden_layers": 36, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": true, "torch_dtype": "bfloat16", "use_cache": true, "use_sliding_window": false, "video_token_id": null, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "depth": 32, "fullatt_block_indexes": [ 7, 15, 23, 31 ], "hidden_act": "silu", "hidden_size": 1280, "in_channels": 3, "in_chans": 3, "initializer_range": 0.02, "intermediate_size": 3420, "model_type": "qwen2_5_vl", "num_heads": 16, "out_hidden_size": 2048, "patch_size": 14, "spatial_merge_size": 2, "spatial_patch_size": 14, "temporal_patch_size": 2, "tokens_per_second": 2, "window_size": 112 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 151936 }
INFO 05-23 13:44:38 [config.py:717] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'. INFO 05-23 13:44:38 [config.py:1770] Defaulting to use mp for distributed inference INFO 05-23 13:44:38 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=5120. WARNING 05-23 13:44:38 [cuda.py:93] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-05-23 13:44:38,364 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-05-23 13:44:38,637 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:1088] 2025-05-23 13:44:38,692 >> loading configuration file saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again/generation_config.json [INFO|configuration_utils.py:1135] 2025-05-23 13:44:38,693 >> Generate config GenerationConfig { "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 1e-06 }
WARNING 05-23 13:44:38 [utils.py:2382] We must use the spawn multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
INFO 05-23 13:44:51 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-23 13:44:51 [init.py:239] Automatically detected platform cuda.
INFO 05-23 13:44:57 [core.py:58] Initializing a V1 LLM engine (v0.8.5.post1) with config: model='saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again', speculative_config=None, tokenizer='saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 05-23 13:44:57 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 05-23 13:44:57 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_5f7ba305'), local_subscribe_addr='ipc:///tmp/8990c269-2086-4b12-b6ed-427acf2d1b5b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 05-23 13:45:10 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-23 13:45:10 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-23 13:45:10 [init.py:239] Automatically detected platform cuda.
INFO 05-23 13:45:10 [init.py:239] Automatically detected platform cuda.
WARNING 05-23 13:45:14 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7208a1a5b070>
WARNING 05-23 13:45:14 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7a3b6ea5c160>
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:14 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6f812bd4'), local_subscribe_addr='ipc:///tmp/24ee3abb-d3dc-4467-93b2-c52e4ac7bdfd', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:14 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e9551106'), local_subscribe_addr='ipc:///tmp/066418ee-f44e-4bc5-976d-efb34b980723', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [utils.py:1055] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [utils.py:1055] Found nccl from library libnccl.so.2
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ns94feza/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/ns94feza/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_1b30d56c'), local_subscribe_addr='ipc:///tmp/d5c40de6-cde0-472e-b0b6-a60f1d32ba91', remote_subscribe_addr=None, remote_addr_ipv6=False)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [parallel_state.py:1004] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [parallel_state.py:1004] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:15 [cuda.py:221] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:15 [cuda.py:221] Using Flash Attention backend on V1 engine.
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m Unused or unrecognized kwargs: fps, return_tensors.
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m Unused or unrecognized kwargs: return_tensors, fps.
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m WARNING 05-23 13:45:19 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m WARNING 05-23 13:45:19 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:19 [gpu_model_runner.py:1329] Starting to load model saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again...
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:19 [gpu_model_runner.py:1329] Starting to load model saves/qwen2_5vl-3b_llarp_finetuning/full/sft/try_again...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m INFO 05-23 13:45:19 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config []
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m INFO 05-23 13:45:19 [config.py:3614] cudagraph sizes specified by model runner [] is overridden by config []
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] WorkerProc failed to start.
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] Traceback (most recent call last):
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 409, in worker_main
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] worker = WorkerProc(*args, **kwargs)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 306, in init
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] self.worker.load_model()
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] self.model_runner.load_model()
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] self.model = get_model(vllm_config=self.vllm_config)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] return loader.load_model(vllm_config=vllm_config)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] loaded_weights = model.load_weights(
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1126, in load_weights
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] autoloaded_weights = set(self._load_module("", self.module, weights))
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] yield from self._load_module(prefix,
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] loaded_params = module_load_weights(weights)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 486, in load_weights
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] return loader.load_weights(weights)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] autoloaded_weights = set(self._load_module("", self.module, weights))
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] yield from self._load_module(prefix,
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] loaded_params = module_load_weights(weights)
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/model_executor/models/qwen2.py", line 405, in load_weights
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] param = params_dict[name]
[1;36m(VllmWorker rank=1 pid=1423873)[0;0m ERROR 05-23 13:45:19 [multiproc_executor.py:435] KeyError: 'language_model.layers.19.input_layernorm.weight'
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
[1;36m(VllmWorker rank=0 pid=1423872)[0;0m
[rank0]:[W523 13:45:20.315050792 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 05-23 13:45:21 [core.py:396] EngineCore failed to start.
ERROR 05-23 13:45:21 [core.py:396] Traceback (most recent call last):
ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-23 13:45:21 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 329, in init
ERROR 05-23 13:45:21 [core.py:396] super().init(vllm_config, executor_class, log_stats,
ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 64, in init
ERROR 05-23 13:45:21 [core.py:396] self.model_executor = executor_class(vllm_config)
ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 52, in init
ERROR 05-23 13:45:21 [core.py:396] self._init_executor()
ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
ERROR 05-23 13:45:21 [core.py:396] self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 05-23 13:45:21 [core.py:396] File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
ERROR 05-23 13:45:21 [core.py:396] raise e from None
ERROR 05-23 13:45:21 [core.py:396] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 329, in init
super().init(vllm_config, executor_class, log_stats,
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/engine/core.py", line 64, in init
self.model_executor = executor_class(vllm_config)
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 52, in init
self._init_executor()
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
self.workers = WorkerProc.wait_for_ready(unready_workers)
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/weakref.py", line 667, in _exitfunc
f()
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/weakref.py", line 591, in call
return info.func(*info.args, **(info.kwargs or {}))
File "/home/ns94feza/miniconda3/envs/llama-factory/lib/python3.9/site-packages/vllm/v1/executor/multiproc_executor.py", line 228, in shutdown
for w in self.workers:
AttributeError: 'MultiprocExecutor' object has no attribute 'workers'
Traceback (most recent call last):
File "/home/ns94feza/miniconda3/envs/llama-factory/bin/llamafactory-cli", line 8, in
Others
No response
same problem while serving fine-tuned qwen2.5 vl 3B model
(VllmWorker rank=2 pid=21227) INFO 05-26 07:08:03 [config.py:3614] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] WorkerProc failed to start.
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] Traceback (most recent call last):
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 409, in worker_main
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 306, in __init__
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] self.worker.load_model()
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] self.model_runner.load_model()
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] return loader.load_model(vllm_config=vllm_config)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 455, in load_model
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] loaded_weights = model.load_weights(
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1126, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] yield from self._load_module(prefix,
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] loaded_params = module_load_weights(weights)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 486, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] return loader.load_weights(weights)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 261, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] yield from self._load_module(prefix,
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 195, in _load_module
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] loaded_params = module_load_weights(weights)
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 405, in load_weights
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] param = params_dict[name]
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] ~~~~~~~~~~~^^^^^^
(VllmWorker rank=3 pid=21228) ERROR 05-26 07:08:03 [multiproc_executor.py:435] KeyError: 'language_model.layers.19.input_layernorm.weight'
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=0 pid=21225)
[rank0]:[W526 07:08:04.170710658 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 05-26 07:08:06 [core.py:396] EngineCore failed to start.
ERROR 05-26 07:08:06 [core.py:396] Traceback (most recent call last):
ERROR 05-26 07:08:06 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 05-26 07:08:06 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 05-26 07:08:06 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-26 07:08:06 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
ERROR 05-26 07:08:06 [core.py:396] super().__init__(vllm_config, executor_class, log_stats,
ERROR 05-26 07:08:06 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
ERROR 05-26 07:08:06 [core.py:396] self.model_executor = executor_class(vllm_config)
ERROR 05-26 07:08:06 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-26 07:08:06 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 05-26 07:08:06 [core.py:396] self._init_executor()
ERROR 05-26 07:08:06 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
ERROR 05-26 07:08:06 [core.py:396] self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 05-26 07:08:06 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-26 07:08:06 [core.py:396] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
ERROR 05-26 07:08:06 [core.py:396] raise e from None
ERROR 05-26 07:08:06 [core.py:396] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
self.workers = WorkerProc.wait_for_ready(unready_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
f()
File "/usr/lib/python3.12/weakref.py", line 590, in __call__
return info.func(*info.args, **(info.kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 228, in shutdown
for w in self.workers:
^^^^^^^^^^^^
AttributeError: 'MultiprocExecutor' object has no attribute 'workers'
Traceback (most recent call last):
File "/app/scripts/vllm_infer.py", line 199, in <module>
fire.Fire(vllm_infer)
File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/app/scripts/vllm_infer.py", line 112, in vllm_infer
llm = LLM(**engine_args)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 1161, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 247, in __init__
self.llm_engine = LLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 510, in from_engine_args
return engine_cls.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 112, in from_vllm_config
return cls(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 92, in __init__
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 73, in make_client
return SyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 494, in __init__
super().__init__(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
self._wait_for_engine_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
root@c08e3a430e9c:/app# /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
root@c08e3a430e9c:/app# ll
Before 3B, I've tried serving fine-tune 7B with lora with no error
Currently, there are some bugs in Transformers 4.52.0-4.52.3 when using vLLM to run inference on fine-tuned models. Our patch is released along with Transformers 4.52.4: https://github.com/huggingface/transformers/pull/38385
You can downgrade Transformers to version 4.51.3 or upgrade to Transformers 4.52.4 and train again to avoid this issue.
问题解决了吗,不需要重新训练能推理了吗
问题解决了吗,不需要重新训练能推理了吗
我也遇到了这个问题,经测试,在升级 transformer 库到 4.52.4 后,是需要重新训练的,否则加载 transformer 原版本训练的模型还是会报错的
我是训练的qwen2.5-VL的lora,transformers从4.52.1升级到4.52.4之后,重新合并了lora之后就可以用vllm正常推理了,应该只是模型网络层的键名没对应上。
我将版本降为 4.51.3 之后训练就没问题了,不过现在有别人 release 的 model 出现这个问题,就比较难办了。