transformers Exception while inference Qwen2VL and Qwen2VL, assert module.weight.shape[1] == 1

System Info

transformers version: 4.52.3
Platform: Linux-5.10.0-1029-oem-x86_64-with-glibc2.31
GPU device: Quadro RTX 8000
Python version: 3.10
Huggingface_hub version: 0.32.2
Safetensors version: 0.5.3
Accelerate version: 0.34.2
PyTorch version (GPU?): 2.5.0+cu124
Using distributed or parallel set-up in script?: No

Who can help?

@zucchini-nlp @qubvel @ArthurZucker

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

I followed these tutorials: https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_vlm_trl.ipynb https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-finetune/qwenvl/train/train_qwen.py

Steps to reproduce the issue:

Fine-tune Qwen2VL or Qwen2.5VL (e.g. "Qwen/Qwen2.5-VL-3B-Instruct") model on custom dataset (Qlora and LoRA enabled, use cuda)
Run the inference on a video (use cuda).

Full log and exception:

- This IS expected if you are initializing Qwen2_5_VLForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Qwen2_5_VLForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Qwen2_5_VLForConditionalGeneration were not initialized from the model checkpoint at /home/user/Desktop/demo/tmp/weights_2025-05-30_13.06.42.192256_qwen_qwen2.5-vl-3b-instruct_b2_e1_vf16_fps1.0/model_qwen2vl_video-lora and are newly initialized: ['model.language_model.layers.0.self_attn.k_proj.bias', 'model.language_model.layers.0.self_attn.k_proj.weight', 'model.language_model.layers.0.self_attn.o_proj.weight', 'model.language_model.layers.0.self_attn.q_proj.bias', 'model.language_model.layers.0.self_attn.q_proj.weight', 'model.language_model.layers.0.self_attn.v_proj.bias', 'model.language_model.layers.0.self_attn.v_proj.weight', 'model.language_model.layers.1.self_attn.k_proj.bias', 'model.language_model.layers.1.self_attn.k_proj.weight', 'model.language_model.layers.1.self_attn.o_proj.weight', 'model.language_model.layers.1.self_attn.q_proj.bias', 'model.language_model.layers.1.self_attn.q_proj.weight', 'model.language_model.layers.1.self_attn.v_proj.bias', 'model.language_model.layers.1.self_attn.v_proj.weight', 'model.language_model.layers.10.self_attn.k_proj.bias', 'model.language_model.layers.10.self_attn.k_proj.weight', 'model.language_model.layers.10.self_attn.o_proj.weight', 'model.language_model.layers.10.self_attn.q_proj.bias', 'model.language_model.layers.10.self_attn.q_proj.weight', 'model.language_model.layers.10.self_attn.v_proj.bias', 'model.language_model.layers.10.self_attn.v_proj.weight', 'model.language_model.layers.11.self_attn.k_proj.bias', 'model.language_model.layers.11.self_attn.k_proj.weight', 'model.language_model.layers.11.self_attn.o_proj.weight', 'model.language_model.layers.11.self_attn.q_proj.bias', 'model.language_model.layers.11.self_attn.q_proj.weight', 'model.language_model.layers.11.self_attn.v_proj.bias', 'model.language_model.layers.11.self_attn.v_proj.weight', 'model.language_model.layers.12.self_attn.k_proj.bias', 'model.language_model.layers.12.self_attn.k_proj.weight', 'model.language_model.layers.12.self_attn.o_proj.weight', 'model.language_model.layers.12.self_attn.q_proj.bias', 'model.language_model.layers.12.self_attn.q_proj.weight', 'model.language_model.layers.12.self_attn.v_proj.bias', 'model.language_model.layers.12.self_attn.v_proj.weight', 'model.language_model.layers.13.self_attn.k_proj.bias', 'model.language_model.layers.13.self_attn.k_proj.weight', 'model.language_model.layers.13.self_attn.o_proj.weight', 'model.language_model.layers.13.self_attn.q_proj.bias', 'model.language_model.layers.13.self_attn.q_proj.weight', 'model.language_model.layers.13.self_attn.v_proj.bias', 'model.language_model.layers.13.self_attn.v_proj.weight', 'model.language_model.layers.14.self_attn.k_proj.bias', 'model.language_model.layers.14.self_attn.k_proj.weight', 'model.language_model.layers.14.self_attn.o_proj.weight', 'model.language_model.layers.14.self_attn.q_proj.bias', 'model.language_model.layers.14.self_attn.q_proj.weight', 'model.language_model.layers.14.self_attn.v_proj.bias', 'model.language_model.layers.14.self_attn.v_proj.weight', 'model.language_model.layers.15.self_attn.k_proj.bias', 'model.language_model.layers.15.self_attn.k_proj.weight', 'model.language_model.layers.15.self_attn.o_proj.weight', 'model.language_model.layers.15.self_attn.q_proj.bias', 'model.language_model.layers.15.self_attn.q_proj.weight', 'model.language_model.layers.15.self_attn.v_proj.bias', 'model.language_model.layers.15.self_attn.v_proj.weight', 'model.language_model.layers.16.self_attn.k_proj.bias', 'model.language_model.layers.16.self_attn.k_proj.weight', 'model.language_model.layers.16.self_attn.o_proj.weight', 'model.language_model.layers.16.self_attn.q_proj.bias', 'model.language_model.layers.16.self_attn.q_proj.weight', 'model.language_model.layers.16.self_attn.v_proj.bias', 'model.language_model.layers.16.self_attn.v_proj.weight', 'model.language_model.layers.17.self_attn.k_proj.bias', 'model.language_model.layers.17.self_attn.k_proj.weight', 'model.language_model.layers.17.self_attn.o_proj.weight', 'model.language_model.layers.17.self_attn.q_proj.bias', 'model.language_model.layers.17.self_attn.q_proj.weight', 'model.language_model.layers.17.self_attn.v_proj.bias', 'model.language_model.layers.17.self_attn.v_proj.weight', 'model.language_model.layers.18.self_attn.k_proj.bias', 'model.language_model.layers.18.self_attn.k_proj.weight', 'model.language_model.layers.18.self_attn.o_proj.weight', 'model.language_model.layers.18.self_attn.q_proj.bias', 'model.language_model.layers.18.self_attn.q_proj.weight', 'model.language_model.layers.18.self_attn.v_proj.bias', 'model.language_model.layers.18.self_attn.v_proj.weight', 'model.language_model.layers.19.self_attn.k_proj.bias', 'model.language_model.layers.19.self_attn.k_proj.weight', 'model.language_model.layers.19.self_attn.o_proj.weight', 'model.language_model.layers.19.self_attn.q_proj.bias', 'model.language_model.layers.19.self_attn.q_proj.weight', 'model.language_model.layers.19.self_attn.v_proj.bias', 'model.language_model.layers.19.self_attn.v_proj.weight', 'model.language_model.layers.2.self_attn.k_proj.bias', 'model.language_model.layers.2.self_attn.k_proj.weight', 'model.language_model.layers.2.self_attn.o_proj.weight', 'model.language_model.layers.2.self_attn.q_proj.bias', 'model.language_model.layers.2.self_attn.q_proj.weight', 'model.language_model.layers.2.self_attn.v_proj.bias', 'model.language_model.layers.2.self_attn.v_proj.weight', 'model.language_model.layers.20.self_attn.k_proj.bias', 'model.language_model.layers.20.self_attn.k_proj.weight', 'model.language_model.layers.20.self_attn.o_proj.weight', 'model.language_model.layers.20.self_attn.q_proj.bias', 'model.language_model.layers.20.self_attn.q_proj.weight', 'model.language_model.layers.20.self_attn.v_proj.bias', 'model.language_model.layers.20.self_attn.v_proj.weight', 'model.language_model.layers.21.self_attn.k_proj.bias', 'model.language_model.layers.21.self_attn.k_proj.weight', 'model.language_model.layers.21.self_attn.o_proj.weight', 'model.language_model.layers.21.self_attn.q_proj.bias', 'model.language_model.layers.21.self_attn.q_proj.weight', 'model.language_model.layers.21.self_attn.v_proj.bias', 'model.language_model.layers.21.self_attn.v_proj.weight', 'model.language_model.layers.22.self_attn.k_proj.bias', 'model.language_model.layers.22.self_attn.k_proj.weight', 'model.language_model.layers.22.self_attn.o_proj.weight', 'model.language_model.layers.22.self_attn.q_proj.bias', 'model.language_model.layers.22.self_attn.q_proj.weight', 'model.language_model.layers.22.self_attn.v_proj.bias', 'model.language_model.layers.22.self_attn.v_proj.weight', 'model.language_model.layers.23.self_attn.k_proj.bias', 'model.language_model.layers.23.self_attn.k_proj.weight', 'model.language_model.layers.23.self_attn.o_proj.weight', 'model.language_model.layers.23.self_attn.q_proj.bias', 'model.language_model.layers.23.self_attn.q_proj.weight', 'model.language_model.layers.23.self_attn.v_proj.bias', 'model.language_model.layers.23.self_attn.v_proj.weight', 'model.language_model.layers.24.self_attn.k_proj.bias', 'model.language_model.layers.24.self_attn.k_proj.weight', 'model.language_model.layers.24.self_attn.o_proj.weight', 'model.language_model.layers.24.self_attn.q_proj.bias', 'model.language_model.layers.24.self_attn.q_proj.weight', 'model.language_model.layers.24.self_attn.v_proj.bias', 'model.language_model.layers.24.self_attn.v_proj.weight', 'model.language_model.layers.25.self_attn.k_proj.bias', 'model.language_model.layers.25.self_attn.k_proj.weight', 'model.language_model.layers.25.self_attn.o_proj.weight', 'model.language_model.layers.25.self_attn.q_proj.bias', 'model.language_model.layers.25.self_attn.q_proj.weight', 'model.language_model.layers.25.self_attn.v_proj.bias', 'model.language_model.layers.25.self_attn.v_proj.weight', 'model.language_model.layers.26.self_attn.k_proj.bias', 'model.language_model.layers.26.self_attn.k_proj.weight', 'model.language_model.layers.26.self_attn.o_proj.weight', 'model.language_model.layers.26.self_attn.q_proj.bias', 'model.language_model.layers.26.self_attn.q_proj.weight', 'model.language_model.layers.26.self_attn.v_proj.bias', 'model.language_model.layers.26.self_attn.v_proj.weight', 'model.language_model.layers.27.self_attn.k_proj.bias', 'model.language_model.layers.27.self_attn.k_proj.weight', 'model.language_model.layers.27.self_attn.o_proj.weight', 'model.language_model.layers.27.self_attn.q_proj.bias', 'model.language_model.layers.27.self_attn.q_proj.weight', 'model.language_model.layers.27.self_attn.v_proj.bias', 'model.language_model.layers.27.self_attn.v_proj.weight', 'model.language_model.layers.28.self_attn.k_proj.bias', 'model.language_model.layers.28.self_attn.k_proj.weight', 'model.language_model.layers.28.self_attn.o_proj.weight', 'model.language_model.layers.28.self_attn.q_proj.bias', 'model.language_model.layers.28.self_attn.q_proj.weight', 'model.language_model.layers.28.self_attn.v_proj.bias', 'model.language_model.layers.28.self_attn.v_proj.weight', 'model.language_model.layers.29.self_attn.k_proj.bias', 'model.language_model.layers.29.self_attn.k_proj.weight', 'model.language_model.layers.29.self_attn.o_proj.weight', 'model.language_model.layers.29.self_attn.q_proj.bias', 'model.language_model.layers.29.self_attn.q_proj.weight', 'model.language_model.layers.29.self_attn.v_proj.bias', 'model.language_model.layers.29.self_attn.v_proj.weight', 'model.language_model.layers.3.self_attn.k_proj.bias', 'model.language_model.layers.3.self_attn.k_proj.weight', 'model.language_model.layers.3.self_attn.o_proj.weight', 'model.language_model.layers.3.self_attn.q_proj.bias', 'model.language_model.layers.3.self_attn.q_proj.weight', 'model.language_model.layers.3.self_attn.v_proj.bias', 'model.language_model.layers.3.self_attn.v_proj.weight', 'model.language_model.layers.30.self_attn.k_proj.bias', 'model.language_model.layers.30.self_attn.k_proj.weight', 'model.language_model.layers.30.self_attn.o_proj.weight', 'model.language_model.layers.30.self_attn.q_proj.bias', 'model.language_model.layers.30.self_attn.q_proj.weight', 'model.language_model.layers.30.self_attn.v_proj.bias', 'model.language_model.layers.30.self_attn.v_proj.weight', 'model.language_model.layers.31.self_attn.k_proj.bias', 'model.language_model.layers.31.self_attn.k_proj.weight', 'model.language_model.layers.31.self_attn.o_proj.weight', 'model.language_model.layers.31.self_attn.q_proj.bias', 'model.language_model.layers.31.self_attn.q_proj.weight', 'model.language_model.layers.31.self_attn.v_proj.bias', 'model.language_model.layers.31.self_attn.v_proj.weight', 'model.language_model.layers.32.self_attn.k_proj.bias', 'model.language_model.layers.32.self_attn.k_proj.weight', 'model.language_model.layers.32.self_attn.o_proj.weight', 'model.language_model.layers.32.self_attn.q_proj.bias', 'model.language_model.layers.32.self_attn.q_proj.weight', 'model.language_model.layers.32.self_attn.v_proj.bias', 'model.language_model.layers.32.self_attn.v_proj.weight', 'model.language_model.layers.33.self_attn.k_proj.bias', 'model.language_model.layers.33.self_attn.k_proj.weight', 'model.language_model.layers.33.self_attn.o_proj.weight', 'model.language_model.layers.33.self_attn.q_proj.bias', 'model.language_model.layers.33.self_attn.q_proj.weight', 'model.language_model.layers.33.self_attn.v_proj.bias', 'model.language_model.layers.33.self_attn.v_proj.weight', 'model.language_model.layers.34.self_attn.k_proj.bias', 'model.language_model.layers.34.self_attn.k_proj.weight', 'model.language_model.layers.34.self_attn.o_proj.weight', 'model.language_model.layers.34.self_attn.q_proj.bias', 'model.language_model.layers.34.self_attn.q_proj.weight', 'model.language_model.layers.34.self_attn.v_proj.bias', 'model.language_model.layers.34.self_attn.v_proj.weight', 'model.language_model.layers.35.self_attn.k_proj.bias', 'model.language_model.layers.35.self_attn.k_proj.weight', 'model.language_model.layers.35.self_attn.o_proj.weight', 'model.language_model.layers.35.self_attn.q_proj.bias', 'model.language_model.layers.35.self_attn.q_proj.weight', 'model.language_model.layers.35.self_attn.v_proj.bias', 'model.language_model.layers.35.self_attn.v_proj.weight', 'model.language_model.layers.4.self_attn.k_proj.bias', 'model.language_model.layers.4.self_attn.k_proj.weight', 'model.language_model.layers.4.self_attn.o_proj.weight', 'model.language_model.layers.4.self_attn.q_proj.bias', 'model.language_model.layers.4.self_attn.q_proj.weight', 'model.language_model.layers.4.self_attn.v_proj.bias', 'model.language_model.layers.4.self_attn.v_proj.weight', 'model.language_model.layers.5.self_attn.k_proj.bias', 'model.language_model.layers.5.self_attn.k_proj.weight', 'model.language_model.layers.5.self_attn.o_proj.weight', 'model.language_model.layers.5.self_attn.q_proj.bias', 'model.language_model.layers.5.self_attn.q_proj.weight', 'model.language_model.layers.5.self_attn.v_proj.bias', 'model.language_model.layers.5.self_attn.v_proj.weight', 'model.language_model.layers.6.self_attn.k_proj.bias', 'model.language_model.layers.6.self_attn.k_proj.weight', 'model.language_model.layers.6.self_attn.o_proj.weight', 'model.language_model.layers.6.self_attn.q_proj.bias', 'model.language_model.layers.6.self_attn.q_proj.weight', 'model.language_model.layers.6.self_attn.v_proj.bias', 'model.language_model.layers.6.self_attn.v_proj.weight', 'model.language_model.layers.7.self_attn.k_proj.bias', 'model.language_model.layers.7.self_attn.k_proj.weight', 'model.language_model.layers.7.self_attn.o_proj.weight', 'model.language_model.layers.7.self_attn.q_proj.bias', 'model.language_model.layers.7.self_attn.q_proj.weight', 'model.language_model.layers.7.self_attn.v_proj.bias', 'model.language_model.layers.7.self_attn.v_proj.weight', 'model.language_model.layers.8.self_attn.k_proj.bias', 'model.language_model.layers.8.self_attn.k_proj.weight', 'model.language_model.layers.8.self_attn.o_proj.weight', 'model.language_model.layers.8.self_attn.q_proj.bias', 'model.language_model.layers.8.self_attn.q_proj.weight', 'model.language_model.layers.8.self_attn.v_proj.bias', 'model.language_model.layers.8.self_attn.v_proj.weight', 'model.language_model.layers.9.self_attn.k_proj.bias', 'model.language_model.layers.9.self_attn.k_proj.weight', 'model.language_model.layers.9.self_attn.o_proj.weight', 'model.language_model.layers.9.self_attn.q_proj.bias', 'model.language_model.layers.9.self_attn.q_proj.weight', 'model.language_model.layers.9.self_attn.v_proj.bias', 'model.language_model.layers.9.self_attn.v_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
  0%|          | 0/111 [00:00<?, ?it/s]Unused or unrecognized kwargs: fps, return_tensors.
/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:354: UserWarning: FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first.
  warnings.warn(
  0%|          | 0/111 [00:16<?, ?it/s]
Traceback (most recent call last):
  File "/home/user/Desktop/demo/hf_qwen_demo_video.py", line 356, in <module>
    eval_videos(
  File "/home/user/Desktop/demo/hf_qwen_demo_video.py", line 212, in eval_videos
    pred_caption_list = run_model_preds(
  File "/home/user/Desktop/demo/hf_qwen_demo_video.py", line 180, in run_model_preds
    output_text = run_model_single_inference(
  File "/home/user/Desktop/demo/hf_qwen_demo_video.py", line 116, in run_model_single_inference
    output_ids = model.generate(**inputs, max_new_tokens=max_token_length)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/generation/utils.py", line 2597, in generate
    result = self._sample(
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/generation/utils.py", line 3557, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/utils/generic.py", line 969, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1908, in forward
    outputs = self.model(
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1728, in forward
    outputs = self.language_model(
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1191, in forward
    layer_outputs = decoder_layer(
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1053, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 938, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 468, in forward
    fix_4bit_weight_quant_state_from_module(self)
  File "/home/szhou/anaconda3/envs/my_project/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 360, in fix_4bit_weight_quant_state_from_module
    assert module.weight.shape[1] == 1
AssertionError

Process finished with exit code 1

Expected behavior

I expect the inference to complete without errors.

Jun 07 '25 21:06 iglaweb

@iglaweb we recently fixed one bug when saving Qwen-VL models and it is in the latest patch release. Can you try to update the transformers version?

Prob you'll need to re-save the checkpoint, I will help with that if you can share it on the hub

Jun 09 '25 07:06 zucchini-nlp

hey @zucchini-nlp I have trained qwen2.5-VL-3B model 1month ago and right now I have this error when attempting to load the adapters..

Unrecognized video processor in /content/lora_model_qwen2.5-VL-3B/lora_model_qwen2.5-VL-3B. Should have a `video_processor_type` key in its video_preprocessor_config.json of config.json, or one of the following `model_type` keys in its config.json: instructblip, instructblipvideo, internvl, llava_next_video, llava_onevision, qwen2_5_omni, qwen2_5_vl, qwen2_vl, smolvlm, video_llava

can you please guide me in how I should tweak the adapter config? I mean.. I hope I dont have to re-train..

Jun 13 '25 15:06 msciancalepore98

I've just seen that transformers has added VideoProcessors as first class citizens processors.. can it be related @zucchini-nlp @danielhanchen ?

Jun 14 '25 09:06 msciancalepore98

@msciancalepore98 can you share your tuned weights and a minimal repro please?

Jun 16 '25 06:06 zucchini-nlp

Just using the official unsloth notebook with no changes. I've executed the cell down below that loads the model from a path.

from unsloth import FastVisionModel
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = True, # Set to False for 16bit LoRA
    )
    FastVisionModel.for_inference(model) # Enable for inference!

This is the zip with the configs, I cannot share the weights unfortunately. lora_qwen2.5VL-3B-configs.zip

Jun 16 '25 07:06 msciancalepore98

Hm I think I found it, can you check if this (https://github.com/huggingface/transformers/pull/38840) resolves your issue?

I see that the shared zip file has no config, and thus we weren't able to infer if model type. Now we check for image processor's type as well

Jun 16 '25 07:06 zucchini-nlp

@zucchini-nlp I've tried to install it in the notebook at the top, but when I execute the model loading cell I have:

ImportError: cannot import name 'HybridCache' from 'transformers.models.gemma3.modeling_gemma3' (/usr/local/lib/python3.11/dist-packages/transformers/models/gemma3/modeling_gemma3.py)

installed with: !pip install -U "git+https://github.com/zucchini-nlp/transformers.git@video_processor"

Jun 16 '25 07:06 msciancalepore98

Are you importing HybridCache anywhere or is that inside unsloth? Prob the version of unsloth isn't compatible with transformers from main branch. We just recently removed HybricCache from gemma and it should be imported only from transformers.cache_utils

In that case, even if the issue is fixed on main branch you might have to add a few hack to make current version compatible with unsloth or wait for for unsloth team bump transformers' next releases

Jun 16 '25 08:06 zucchini-nlp

nop not using that anywhere in the notebook. I'd say it's on the unsloth side, but idk.

I've created an issue on unsloth as well to track it.

thanks for your time!

Jun 16 '25 08:06 msciancalepore98

@iglaweb we recently fixed one bug when saving Qwen-VL models and it is in the latest patch release. Can you try to update the transformers version?

Prob you'll need to re-save the checkpoint, I will help with that if you can share it on the hub

@zucchini-nlp Thank you for reply. I updated the transformers library to 4.52.4. Then, I fine-tuned "Qwen/Qwen2.5-VL-3B-Instruct" on a small portion of my dataset. Hm, it still does not work. I tested the model in two scenarios below.

Scenario 1 (train free). I implemented a script for loading the pre-trained model, saving with model.save_pretrained and loading the model again from a saved folder (Qwen2_5_VLForConditionalGeneration.from_pretrained), it works without any errors.

Scenario 2 (fine-tune). This is the issue I want to fix. I train a model using SFTTrainer with LoraConfig and BitsAndBytesConfig as follows:

from trl import SFTConfig, SFTTrainer, get_kbit_device_map
from transformers import Qwen2VLProcessor, Qwen2_5_VLProcessor, \
    AutoModelForVision2Seq
from transformers import AutoProcessor, BitsAndBytesConfig
from peft import LoraConfig
import torch

bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_use_double_quant=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.float16,
)
model_kwargs = dict(
    revision='main',
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map=get_kbit_device_map(),
    quantization_config=bnb_config,
)
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
processor = AutoProcessor.from_pretrained(
    model_id, trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(model_id, **model_kwargs)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=prepared_train_dataset,
    eval_dataset=prepared_val_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    processing_class=processor.tokenizer,
)
model.save_pretrained(f"{out_dir}/model_qwen2")
processor.save_pretrained(f"{out_dir}/proc_qwen2")

And, then I'm getting the same error while doing an inference on a fine-tuned model. The model is loaded through Qwen2_5_VLForConditionalGeneration.from_pretrained. Full log with an error is here https://gist.github.com/iglaweb/87d8f29a02f0d2032566206711b8a9bb

Jun 17 '25 20:06 iglaweb

In my case, downgrading to Transformers: 4.51.3 fixes this. (which was the transformer version I used along with unsloth ~1 month ago when training)

Jun 18 '25 06:06 msciancalepore98

Hm I will take a look and train on dummy small dataset, maybe we need to patch something in the trainer 🙃

Jun 18 '25 07:06 zucchini-nlp

@iglaweb the below sciprt works with the latest transformers installed from main if we let the trained save the checkpoint. But I am not sure if you are saving the model manually at the end with model.save_pretrained()

With model.save_pretrained() and loading back I get a state dict mismatch error (expected since you saved manually the model and happens with any other model), but no error about the shape. Can you provide a minimal reproducer with dummy dataset?

from trl import SFTConfig, SFTTrainer, get_kbit_device_map
from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig
import torch
import random
from datasets import load_dataset


bnb_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_use_double_quant=True,
      bnb_4bit_quant_type="nf4",
      bnb_4bit_compute_dtype=torch.float16,
)

model_kwargs = dict(
    revision='main',
    torch_dtype=torch.float16,
    device_map=get_kbit_device_map(),
    quantization_config=bnb_config,
)

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id, **model_kwargs)

dataset = load_dataset("HuggingFaceH4/llava-instruct-mix-vsft", split="test")
dataset = dataset.select([0, 1])

def collate_fn(examples):
    texts = [processor.apply_chat_template(message, tokenize=False) for message in examples["messages"]]
    batch = processor(text=texts, return_tensors="pt", padding=True)
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    batch["labels"] = labels
    return batch

dataset = dataset.map(collate_fn, batched=True)

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)


trainer = SFTTrainer(
    model=model,
    args=TrainingArguments(
        max_steps=2,
        per_device_train_batch_size=2,
        output_dir="temp_qwen",
        save_strategy="steps",
        save_steps=1,
    ),
    train_dataset=dataset,
    peft_config=peft_config,
    processing_class=processor.tokenizer,
)

trainer.train() # saves checkpoints every step
# model.save_pretrained(f"temp_qwen")
# processor.save_pretrained(f"temp_qwen")

# Load back from the first saved checkpoint
model = AutoModelForImageTextToText.from_pretrained("temp_qwen/checkpoint-1")

Jun 18 '25 12:06 zucchini-nlp

@zucchini-nlp Thanks a lot for the snippet. Once I changed model.save_pretrained() to trainer.save_model together with args.output_dir for the training, I was able to make an inference successfully on a fine-tuned model. But, then, the model output for any video used in the prompt, is the following:

I'm sorry, but I cannot answer your question as there is no video attached to this prompt. Please provide me with the necessary information so that I can assist you better.

So, this is the answer which I always get from my fine-tuned model. If I use just the pre-trained model with AutoModelForImageTextToText.from_pretrained, the answer is fine, no warning about missing video. The videos do exist on the disk. I used the following code for a video inference:

def run_model_single_inference(model, processor, video_path, llm_prompt, num_frames=8, max_token_length=100):
    conversation = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "path": video_path,
                    "max_pixels": 360 * 420,
                    "fps": 1.0
                },
                {"type": "text", "text": llm_prompt},
            ],
        }
    ]

    # Chat template will load the image/video for you and return inputs in torch.Tensor
    inputs = processor.apply_chat_template(
        conversation,
        # how many num_frames to sample from video, otherwise the whole video will be loaded
        num_frames=num_frames,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device)#, dtype=model.dtype)
    output_ids = model.generate(**inputs, max_new_tokens=max_token_length)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    if isinstance(output_text, list):
        output_text = output_text[0]
    return output_text.strip()

The following code is being used for training data preparation:

def prepare_dataset(example, llm_text_prompt, max_pixels, fps=1.0) -> dict[str, list[dict[str, Any]]]:
    video_path = example["short_clip_path"]
    answer = example["caption"]
    system_message = "You are an expert in video analysis."
    messages = [
        {"role": "system", "content": [{"type": "text", "text": system_message}]},
        {
            "role": "user",
            "content": [
                {"type": "video", "video": video_path, "max_pixels": max_pixels, "fps": float(fps)},
                {"type": "text", "text": f"Question: {llm_text_prompt}"},
            ],
        },
        {"role": "assistant", "content": [{"type": "text", "text": answer}]},
    ]
    return {"messages": messages}

With model.save_pretrained() and loading back I get a state dict mismatch error (expected since you saved manually the model and happens with any other model), but no error about the shape.

Hm, why should I expect to get a state dict mismatch error? Could you explain the purpose of each method and when should I use each of them? Which one is best to use if I want to load the fine-tuned weights? model.save_pretrained() trainer.save_model() trainer.model.save_pretrained training_args.output_dir model.save_pretrained() works without any errors for the Llava-Next-Video.

I fine-tuned a model with peft_config=peft_config, but, then the following code does not work. Is it normal?

adapter_path = 'temp_qwen/adapter_config.json'
model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
if os.path.exists(adapter_path):
    model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
    model = PeftModel.from_pretrained(
      model,
     'temp_qwen',
      device_map="auto",
      trust_remote_code=True).eval().cuda()

Jun 19 '25 17:06 iglaweb

Hmm, I don't really know how SFT uses your input messages, probably it is expecting video inputs as a separate key in the dataset. I suggest to look at how the example scripts handle inputs here. In any case, seems like the training went wrong

Hm, why should I expect to get a state dict mismatch error?

Using PEFT changes the model's state dict and when you just save the model, it will be saved with extra base_model. prefix in the state dict. Using trainer to save will remove the extra keys and hooks added on top of the model, and save it correctly for you. In general, when training you don't have to save the model yourself and the trainer saves the outputs every K steps as indicated in training args. For all other cases model.save_pretrained() works perfectly fine

Jun 20 '25 06:06 zucchini-nlp

@zucchini-nlp Thanks a lot for quick response. As you mentioned, the problem was in the message template. I used "path": video_path, instead of "video": video_path. Right now, the training and inference works. But...

There is one more issue. Once I fine-tuned the model and saved it with trainer.save_model('./saved_model'), I'm not able t load a correct processor through processor = AutoProcessor.from_pretrained('./saved_model'). For some reason, it is Qwen2TokenizerFast instead of Qwen2_5_VLProcessor (got an error that images and videos args not found). Then, I have to use the default name explicitly processor = AutoProcessor.from_pretrained('Qwen/Qwen2.5-VL-3B-Instruct'). So, why does it happen? Should I use processor.save_pretrained in this case? Is it safe to use the default huggingface processor id after training? Does the Trainer update the tokenizer?

Jun 24 '25 02:06 iglaweb

Tbh, I don't really know if the trainers can save the processor currently. I see that in Trainer you got processing_class=processor.tokenizer so it means the trainer has no chance to save the processor without access to it

My question is, if we pass the processor as a processing class, does that fail during training (I believe it shouldn't)? And yeah, using the official processor will also work since we didn't change when tuning :)

Jun 24 '25 06:06 zucchini-nlp

@zucchini-nlp

My question is, if we pass the processor as a processing class, does that fail during training (I believe it shouldn't)? And yeah, using the official processor will also work since we didn't change when tuning :)

Thank you. Training is not affected now, it runs successfully.

I just noticed a new error raising while doing an inference on a fine-tuned model (from "Qwen/Qwen2.5-VL-3B-Instruct"). It fails on a same input every time, but prompts are very similar (the video input is slightly different).

C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\native\cuda\TensorCompare.cu:110: block: [0,0,0], thread: [0,0,0] Assertion `input[0] != 0` failed.
 21%|██        | 23/111 [02:28<09:29,  6.48s/it]
Traceback (most recent call last):
  File "hf_qwen_demo_video.py", line 420, in <module>
    eval_videos(
  File "hf_qwen_demo_video.py", line 208, in eval_videos
    pred_caption_list = run_model_preds(
  File "hf_qwen_demo_video.py", line 174, in run_model_preds
    output_text = run_model_single_inference(
  File "hf_qwen_demo_video.py", line 104, in run_model_single_inference
    generated_ids = model.generate(**inputs, max_new_tokens=100)
  File "C:\Users\User\.conda\envs\hf_qwen_proj\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\User\.conda\envs\hf_qwen_proj\lib\site-packages\transformers\generation\utils.py", line 2597, in generate
    result = self._sample(
  File "C:\Users\User\.conda\envs\hf_qwen_proj\lib\site-packages\transformers\generation\utils.py", line 3602, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Any suggestions how to avoid it? Let me know what additional logs may help here.

UPD. Adding do_sample=False to model.generate eliminates an exception.

Jun 25 '25 17:06 iglaweb

Might be some over/underflowing issue if the final logits have NaN. Can you try to change the torch_dtype and the attn_implementation when loading the model?

I remember Qwen had some issues with specific combinations of those, for example in https://github.com/huggingface/transformers/issues/33294

Jun 30 '25 08:06 zucchini-nlp

@zucchini-nlp I switched to attn_implementation='flash_attention_2' and then I got the following error: AttributeError: 'Qwen2_5_VLVisionAttention' object has no attribute 'is_causal' But it looks like it is a known issue in 4.53.0 https://github.com/huggingface/transformers/issues/39095 Can't use FA2 with 4.52.4 due to "RuntimeError: FlashAttention only supports Ampere GPUs or newer.". I have RTX 8000

I did tests with v. 4.52.4 and 4.53.0. Qwen2.5-VL model was trained with LoRA only: torch_dtype=torch.float16, attn_implementation='sdpa produces "!!!" output for all cases torch_dtype=torch.bfloat16, attn_implementation='sdpa leads to CUDA out of memory torch_dtype=torch.float32, attn_implementation='sdpa produces "!!!" output for all cases

attn_implementation='eager' leads to CUDA out of memory

do_sample=False, no issue do_sample=True raises exception

RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Model trained with LoRA and QLoRA does not lead to RuntimeError: CUDA error: device-side assert triggered... and produces a valid output / answer. Checked with v. 4.52.4. It also works with do_sample=False as well as do_sample=True. Tested with sdpa only.

Jul 03 '25 00:07 iglaweb

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 27 '25 08:07 github-actions[bot]