MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

[BUG] <title> train openbmb/MiniCPM-V-4_5 lora error

Open XYZ-916 opened this issue 4 months ago • 5 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [x] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [x] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

[rank0]: Traceback (most recent call last): [rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/finetune.py", line 303, in [rank0]: train() [rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/finetune.py", line 293, in train [rank0]: trainer.train() [rank0]: File "/usr/lib/python3.12/site-packages/transformers/trainer.py", line 2207, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/transformers/trainer.py", line 2549, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/trainer.py", line 199, in training_step [rank0]: loss = self.compute_loss(model, inputs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/mnt/workspace/qwen/light_vl_model/MiniCPM-o-main/finetune/trainer.py", line 23, in compute_loss [rank0]: outputs = self.model.base_model(data = inputs, use_cache=False) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 222, in forward [rank0]: return self.model.forward(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/modeling_minicpmv.py", line 204, in forward [rank0]: vllm_embedding, vision_hidden_states = self.get_vllm_embedding(data) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/modeling_minicpmv.py", line 127, in get_vllm_embedding [rank0]: vision_embedding = self.resampler(vision_embedding, tgt_sizes, all_temporal_ids) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/peft/utils/other.py", line 412, in forward [rank0]: return self._forward_wrapped(x, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/peft/utils/other.py", line 484, in _forward_wrapped [rank0]: return self.modules_to_save[self.active_adapters[0]](x, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/resampler.py", line 223, in forward [rank0]: out = self.batch_attn_forward(q, k, v, pos_embed_temporal, temporal_ids, key_padding_mask) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/.cache/huggingface/modules/transformers_modules/minicpm_v_4_5/resampler.py", line 265, in batch_attn_forward [rank0]: out = self.attn( [rank0]: ^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/modules/activation.py", line 1373, in forward [rank0]: attn_output, attn_output_weights = F.multi_head_attention_forward( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/lib/python3.12/site-packages/torch/nn/functional.py", line 6298, in multi_head_attention_forward [rank0]: k = k.view(k.shape[0], bsz * num_heads, head_dim).transpose(0, 1) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: shape '[1034, 33088, 128]' is invalid for input of size 38117376 0%| | 0/10000 [00:01<?, ?it/s] [rank0]:[W827 11:09:19.984526291 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) E0827 11:09:20.400000 989550 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 989573) of binary: /usr/bin/python3.12 Traceback (most recent call last): File "/usr/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/usr/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/usr/lib/python3.12/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/usr/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-08-27_11:09:20 host : dsw-352794-6d7f6547dd-qvbvz rank : 0 (local_rank: 0) exitcode : 1 (pid: 989573) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

XYZ-916 avatar Aug 27 '25 03:08 XYZ-916

hi, quick triage on your trace:

the crash is surfaced by torch.distributed.elastic.ChildFailedError, but the first failure happens earlier in get_vlm_embedding → forward. this pattern usually comes from a schema/version drift between components rather than a single line bug. in our map it matches ProblemMap No.2 (model–tokenizer or processor mismatch) plus No.14 (bootstrap ordering on vision/audio towers and LoRA attach order).

fast checks to confirm:

  1. Pin exact pairs: model id, tokenizer, and image processor from the same commit. print and compare vocab size, pad id, vision dims, patch size.
  2. LoRA attach order: attach after the base is built and processors are bound. verify target_modules actually exist.
  3. Reduce variables: run 1 GPU, CUDA_VISIBLE_DEVICES=0, num_workers=0, pin_memory=False, tiny dataset, epochs=1.
  4. Dtype/device: ensure bf16 is supported, or switch to fp16; check that all inputs land on the same device before forward.
  5. Versions: freeze transformers, accelerate, bitsandbytes, flash-attn/xformers to the recipe known-good set.
  6. Batch sanity: print one batch shapes just before the model call; catch None or empty tensors from collate.

if you want, i can share the short checklist we use for No.2 + No.14 so you can tick through in a few minutes. just say the word and i’ll drop it.

onestardao avatar Aug 27 '25 04:08 onestardao

同样的error,请问解决了吗

zyandtom avatar Sep 02 '25 11:09 zyandtom

solved,这个是因为image_processor中生成的temporal_ids没有被输入,需要手动在dataset中生成,或者直接拿到image_processor的输出放进collator。生成过程参考https://github.com/hiyouga/LLaMA-Factory/blob/59f2bf1ea369ca91774b99e8d94a578657be6c7c/src/llamafactory/data/mm_plugin.py#L951

zyandtom avatar Sep 03 '25 11:09 zyandtom

solved,这个是因为image_processor中生成的temporal_ids没有被输入,需要手动在dataset中生成,或者直接拿到image_processor的输出放进collator。生成过程参考https://github.com/hiyouga/LLaMA-Factory/blob/59f2bf1ea369ca91774b99e8d94a578657be6c7c/src/llamafactory/data/mm_plugin.py#L951

大佬有没有具体的修改方案代码呀 @zyandtom

Xuanaxx avatar Sep 05 '25 07:09 Xuanaxx

solved,这个是因为image_processor中生成的temporal_ids没有被输入,需要手动在dataset中生成,或者直接拿到image_processor的输出放进collator。生成过程参考https://github.com/hiyouga/LLaMA-Factory/blob/59f2bf1ea369ca91774b99e8d94a578657be6c7c/src/llamafactory/data/mm_plugin.py#L951

大佬有没有具体的修改方案代码呀 @zyandtom

就在collator里加一个temporal_ids字段给none就好,或者你有需求输入的话参考https://huggingface.co/openbmb/MiniCPM-V-4_5 官方demo实现

zyandtom avatar Sep 05 '25 07:09 zyandtom