[BUG] <title> train openbmb/MiniCPM-V-4_5 lora error
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
- [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
- [x] 我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/finetune.py", line 303, in
[rank0]: train()
[rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/finetune.py", line 293, in train
[rank0]: trainer.train()
[rank0]: File "/usr/lib/python3.12/site-packages/transformers/trainer.py", line 2207, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/transformers/trainer.py", line 2549, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/trainer.py", line 199, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/trainer.py", line 23, in compute_loss
[rank0]: outputs = self.model.base_model(data = inputs, use_cache=False)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 222, in forward
[rank0]: return self.model.forward(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/modeling_minicpmv.py", line 204, in forward
[rank0]: vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/modeling_minicpmv.py", line 127, in get_vllm_embedding
[rank0]: vision_embedding = self.resampler(vision_embedding, tgt_sizes, all_temporal_ids)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/peft/utils/other.py", line 412, in forward
[rank0]: return self._forward_wrapped(x, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/peft/utils/other.py", line 484, in _forward_wrapped
[rank0]: return self.modules_to_save[self.active_adapters[0]](x, *args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/resampler.py", line 223, in forward
[rank0]: out = self.batch_attn_forward(q, k, v, pos_embed_temporal, temporal_ids, key_padding_mask)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/resampler.py", line 265, in batch_attn_forward
[rank0]: out = self.attn(
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/activation.py", line 1373, in forward
[rank0]: attn_output, attn_output_weights = F.multi_head_attention_forward(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/functional.py", line 6298, in multi_head_attention_forward
[rank0]: k = k.view(k.shape[0], bsz * num_heads, head_dim).transpose(0, 1)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: shape '[1034, 33088, 128]' is invalid for input of size 38117376
0%| | 0/10000 [00:01<?, ?it/s]
[rank0]:[W827 11:09:19.984526291 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0827 11:09:20.400000 989550 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 989573) of binary: /usr/bin/python3.12
Traceback (most recent call last):
File "/usr/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/usr/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-08-27_11:09:20 host : dsw-352794-6d7f6547dd-qvbvz rank : 0 (local_rank: 0) exitcode : 1 (pid: 989573) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
期望行为 | Expected Behavior
No response
复现方法 | Steps To Reproduce
No response
运行环境 | Environment
- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):
备注 | Anything else?
No response
hi, quick triage on your trace:
the crash is surfaced by torch.distributed.elastic.ChildFailedError, but the first failure happens earlier in get_vlm_embedding → forward. this pattern usually comes from a schema/version drift between components rather than a single line bug. in our map it matches ProblemMap No.2 (model–tokenizer or processor mismatch) plus No.14 (bootstrap ordering on vision/audio towers and LoRA attach order).
fast checks to confirm:
- Pin exact pairs: model id, tokenizer, and image processor from the same commit. print and compare vocab size, pad id, vision dims, patch size.
- LoRA attach order: attach after the base is built and processors are bound. verify
target_modulesactually exist. - Reduce variables: run 1 GPU,
CUDA_VISIBLE_DEVICES=0,num_workers=0,pin_memory=False, tiny dataset,epochs=1. - Dtype/device: ensure bf16 is supported, or switch to fp16; check that all inputs land on the same device before forward.
- Versions: freeze
transformers,accelerate,bitsandbytes,flash-attn/xformersto the recipe known-good set. - Batch sanity: print one batch shapes just before the model call; catch
Noneor empty tensors from collate.
if you want, i can share the short checklist we use for No.2 + No.14 so you can tick through in a few minutes. just say the word and i’ll drop it.
同样的error,请问解决了吗
solved,这个是因为image_processor中生成的temporal_ids没有被输入,需要手动在dataset中生成,或者直接拿到image_processor的输出放进collator。生成过程参考https://github.com/hiyouga/LLaMA-Factory/blob/59f2bf1ea369ca91774b99e8d94a578657be6c7c/src/llamafactory/data/mm_plugin.py#L951
solved,这个是因为image_processor中生成的temporal_ids没有被输入,需要手动在dataset中生成,或者直接拿到image_processor的输出放进collator。生成过程参考https://github.com/hiyouga/LLaMA-Factory/blob/59f2bf1ea369ca91774b99e8d94a578657be6c7c/src/llamafactory/data/mm_plugin.py#L951
大佬有没有具体的修改方案代码呀 @zyandtom
solved,这个是因为image_processor中生成的temporal_ids没有被输入,需要手动在dataset中生成,或者直接拿到image_processor的输出放进collator。生成过程参考https://github.com/hiyouga/LLaMA-Factory/blob/59f2bf1ea369ca91774b99e8d94a578657be6c7c/src/llamafactory/data/mm_plugin.py#L951
大佬有没有具体的修改方案代码呀 @zyandtom
就在collator里加一个temporal_ids字段给none就好,或者你有需求输入的话参考https://huggingface.co/openbmb/MiniCPM-V-4_5 官方demo实现