peft icon indicating copy to clipboard operation
peft copied to clipboard

Prompt tuning for VLMs like Qwen2.5VL

Open Martin9797 opened this issue 3 weeks ago • 11 comments

Hi, I have been experimenting with with prompt tuning Qwen2.5VL-7B-Instruct using the training setup from this page: https://huggingface.co/learn/cookbook/fine_tuning_vlm_trl and the Prompt tuning config:

prompt_tune_config = PromptTuningConfig( task_type=TaskType.CAUSAL_LM, prompt_tuning_init=PromptTuningInit.RANDOM, num_virtual_tokens=20, tokenizer_name_or_path=model_name )

as the "peft_config" parameter for the SFTTrainer.

When prompt-tuning a model I get output that is mostly nonsense like:

'{"bbox_2d\n\n addCriterion\n addCriterion\n addCriterion\n addCriterion\n\n addCriterion\n addCriterion\n\n addCriterion\n\n\nGuidId\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n 自动生成\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\',

where the only thing that should be part of the answer is the bbox_2d part.

The model used for this has been previously lora-finetuned using the exact same training setup but with a lora-config as the "peft_config" parameter of the SFTTrainer, and has given perfectly fine responses before prompt tuning.

My question is, is prompt tuning implemented in a way that it should work with VLMs like Qwen2.5VL or is it only meant for pure LLMs? In case it is implemented and should work in theory I would be glad if anyone has an Idea why the output gets messed up after prompt tuning.

Also, during prompt tuning, the eval_loss starts out at the same value where the lora-trained model left off and actually does decrease, even if only very slightly.

Thank you for your any help you can provide!

Martin9797 avatar Nov 05 '25 16:11 Martin9797

Hi, I checked the notebook for possible problems. Some issues I saw:

  • It manually calls peft_model = get_peft_model(model, peft_config) but then passes the peft_config again to SFTTrainer, which leads to double wrapping. Remove the get_peft_model call and let trl handle it.
  • num_virtual_tokens=20 is rather small compared to the number of trainable parameters you have with LoRA, try a bigger value (100+).
  • Also, with prompt tuning, the learning rate can be higher, try 1e-3.
  • Instead of PromptTuningInit.RANDOM, try TEXT or SAMPLE_VOCAB. The latter worked best for me but requires PEFT to be installed from source.

Also, during prompt tuning, the eval_loss starts out at the same value where the lora-trained model left off and actually does decrease, even if only very slightly.

Just to be sure, do you create a completely new notebook for prompt learning? Don't try to continue learning in the same notebook session.

BenjaminBossan avatar Nov 07 '25 11:11 BenjaminBossan

Thank you for taking the time to respond, I have tried out your suggestions and still get the same results. Maybe it is helpful to mention that I get this warning along with my nonsense ouput:

peft_model.py:2141: UserWarning: Position ids are not supported for parameter efficient tuning. Ignoring position ids.
  warnings.warn("Position ids are not supported for parameter efficient tuning. Ignoring position ids.")

['{"kw\n addCriterion\n addCriterion\n addCriterion\n addCriterion\n\n addCriterion\n addCriterion\n\n addCriterion\n addCriterion\n\n\n addCriterion\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']

I am using the Code from the notebook but copied it (with necessary changes) into a local script where I execute it with changes in parameters to first get the LoRA trained model (which works great), then I load that newly trained model and use it to do prompt tuning and at the end i try out inference to see if the answers make sense. Below is my script minus local paths:

import os
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoProcessor, Qwen2_5_VLConfig, Qwen2_5_VLForConditionalGeneration, \
    Qwen3VLMoeForConditionalGeneration, BitsAndBytesConfig, Qwen2VLForConditionalGeneration, Qwen3VLForConditionalGeneration
from peft import LoraConfig, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftModel, PrefixTuningConfig
from trl import SFTConfig, SFTTrainer
from qwen_vl_utils import process_vision_info
import trackio
import gc
import time

proj_root = "<myprojroot>/"
system_message = """You are a vision language model designed to find objects"""

def process_inputs(conversation):
    # Preparation for inference
    text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs, video_kwargs = process_vision_info(conversation, image_patch_size=16, return_video_kwargs=True)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        #fps=30,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    return inputs.to("cuda")

def generate_response(inputs, model):
        # Inference
    try:
        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_new_tokens=1500)
            generated_ids_trimmed = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
            ]
            output_text = processor.batch_decode(
                generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )
            return output_text
    finally:
        del generated_ids, generated_ids_trimmed
        torch.cuda.empty_cache()

def format_data(sample):
    return {
      "images": ["<myPathToImgs>" + sample["images"][0]],
      "messages": [
          {
              "role": "system",
              "content": [
                  {
                      "type": "text",
                      "text": system_message
                  }
              ],
          },
          {
              "role": "user",
              "content": [
                  {
                      "type": "image",
                      "image": "<myPathToImgs>" + sample["images"][0],
                  },
                  {
                      "type": "text",
                      "text": sample["conversations"][0]["value"],
                  }
              ],
          },
          {
              "role": "assistant",
              "content": [
                  {
                      "type": "text",
                      "text": sample["conversations"][1]["value"]
                  }
              ],
          },
      ]
      }

dataset_path = "<myPathToImgs>"

train_dataset_, eval_dataset_ = load_dataset(dataset_path, data_files="<myDataset>.json", split=['train[:90%]', 'train[-10%:]'])

train_dataset = [format_data(sample) for sample in train_dataset_]
eval_dataset = [format_data(sample) for sample in eval_dataset_]

from transformers import Qwen2_5_VLConfig
model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    #attn_implementation="flash_attention_2",
    device_map="auto")

processor = AutoProcessor.from_pretrained(model_name, use_fast=True)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

# Initialize PromptTuning model
prompt_tune_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.SAMPLE_VOCAB,
    num_virtual_tokens=150,
    tokenizer_name_or_path=model_name
)

output_directory = os.path.join(proj_root, "qwen2-5-7b-instruct-lora")
output_directory_prompt_tune = os.path.join(proj_root, "qwen2-5-7b-instruct-prompt-tune")

# Configure training arguments
training_args = SFTConfig(
    output_dir=output_directory_prompt_tune,  # use for prompt tuning
    #output_dir=output_directory # use for lora
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=2,  # Batch size for training
    per_device_eval_batch_size=2,  # Batch size for evaluation
    gradient_accumulation_steps=8,  # Steps to accumulate gradients
    gradient_checkpointing=False,
    #gradient_checkpointing_kwargs={"use_reentrant": False},  # Options for gradient checkpointing
    max_length=None,
    # Optimizer and scheduler settings
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=2e-4,  # Learning rate for training
    #learning_rate=1e-3, #learning rate for prompt tuning
    # Logging and evaluation
    logging_steps=10,  # Steps interval for logging
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",  # Strategy for evaluation
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Steps interval for saving
    # Mixed precision and gradient settings
    bf16=True,  # Use bfloat16 precision
    max_grad_norm=0.3,  # Maximum norm for gradient clipping
    warmup_ratio=0.03,  # Ratio of total steps for warmup
    push_to_hub=False,  # Whether to push model to Hugging Face Hub
)

#------- This is used for training either LoRA or prompt tuning, depending on peft_config parameter -----------
"""
trained_model = PeftModel.from_pretrained(model, os.path.join(output_directory, "checkpoint-153"), device_map="auto", is_trainable=False)


trainer = SFTTrainer(
    model=trained_model,   #use for prompt tuning with a trained lora checkpoint
    #model=model  #use for lora
    args=training_args,
    train_dataset=train_dataset,
    #peft_config=peft_config, #use for lora
    peft_config=prompt_tune_config,  #use for prompt tuning
    eval_dataset=eval_dataset,
    processing_class=processor,
)

trainable_params=trainer.get_num_trainable_parameters()
print(f"Trainable parameters: {trainable_params}")
print(trainer.model)

trainer.train()

trainer.save_model(training_args.output_dir)
"""

#-------------------------- This is used for checking an example -----------------------------------------
#"""
loaded_model = PeftModel.from_pretrained(model, os.path.join(training_args.output_dir, "checkpoint-153"), device_map="auto", is_trainable=False)


print(eval_dataset[15])
print("\n\n")
inputs = process_inputs(eval_dataset[15]["messages"])
response = generate_response(inputs, loaded_model)
print(response)
#"""

Martin9797 avatar Nov 07 '25 13:11 Martin9797

Do I understand correctly that your goal is:

  1. First train LoRA on the dataset
  2. Save the LoRA model
  3. Load the LoRA model
  4. Train prompt tuning on top of the LoRA model

What is the idea there? I don't really see a reason to do that. If you really need to do it like that, could you try adding this in the second stage:

trained_model = PeftModel.from_pretrained(model, os.path.join(output_directory, "checkpoint-153"), device_map="auto", is_trainable=False)
+ trained_model = trained_model.merge_and_unload()

Checking the Qwen2_5_VLModel class, from what I can tell, prompt-tuning will only affect the language model part, the vision model part will not be affected by prompt-tuning and hence will not improve.

BenjaminBossan avatar Nov 07 '25 14:11 BenjaminBossan

You understand that correctly as the purpose of the script right now.

The original goal was prompt tuning on the original Qwen2.5VL model only. But, since that already produced these weird outputs with "addCriterion\n" being spammed, I wanted to see if it would still happen on a properly LoRA trained model (as to prove that prompt tuning really makes the output worse)

So just skipping the LoRA stage also produces these same artefacts in the output.

As for my reasons, I know that just lora-finetuning the language part of the qwen2.5VL model shows better results for my use case. So now I wanted to see if just Prompt tuning will also improve my results and how close it can come to a lora-finetune. You could say the LoRA in between is just for debugging, the actual issue is the output being broken.

Martin9797 avatar Nov 07 '25 15:11 Martin9797

I see, thanks for explaining further. I don't think that prompt-tuning is fundamentally broken with Qwen 2.5 VL. When I changed your script to use the original dataset from the notebook and trained with a smaller model, Qwen/Qwen2.5-VL-3B-Instruct (due to VRAM), it looked like it learned properly and didn't produce garbage. With your data, it could be different.

You could spend some time checking different hyper-parameters (prompt tuning init, number of virtual tokens, learning rate, optimizer, ...) but in your place, I would probably stick with LoRA. LoRA generally has higher capacity to learn, so if your problem is hard, it is the better choice. Also, LoRA can be applied to the vision part, whereas prompt-tuning cannot improve that, making it even more limited. Finally, LoRA can be merged, eliminating any overhead at inference time. So it looks like the better fit for you.

BenjaminBossan avatar Nov 07 '25 15:11 BenjaminBossan

So you used the HuggingFaceM4/ChartQA Dataset, a smaller model and otherwise the same script I posted and no issues with the output? Then that narrows it down to either my dataset being faulty (although it works with LoRA) or the model behaving differently with a smaller size.

I am aware that LoRA is likely to perform a lot better, I just wanted to see how far prompt tuning can get me.

I'll try and recreate your working setup and see if I can make my dataset work from there.

Thanks a lot for the support!

Martin9797 avatar Nov 07 '25 15:11 Martin9797

For completeness, here is what I used:

import os
from datasets import load_dataset
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from peft import LoraConfig, PromptTuningInit, PromptTuningConfig, TaskType, PeftModel
from trl import SFTConfig, SFTTrainer
from qwen_vl_utils import process_vision_info

proj_root = "/tmp/peft/2899"
system_message = """You are a vision language model designed to find objects"""

def process_inputs(conversation):
    # Preparation for inference
    text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs, video_kwargs = process_vision_info(conversation, image_patch_size=16, return_video_kwargs=True)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        #fps=30,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    return inputs.to("cuda")

def generate_response(inputs, model):
    # Inference
    try:
        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_new_tokens=1500)
            generated_ids_trimmed = [
                out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
            ]
            output_text = processor.batch_decode(
                generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
            )
            return output_text
    finally:
        del generated_ids, generated_ids_trimmed
        torch.cuda.empty_cache()


def format_data(sample):
    return {
      "images": [sample["image"]],
      "messages": [

          {
              "role": "system",
              "content": [
                  {
                      "type": "text",
                      "text": system_message
                  }
              ],
          },
          {
              "role": "user",
              "content": [
                  {
                      "type": "image",
                      "image": sample["image"],
                  },
                  {
                      "type": "text",
                      "text": sample['query'],
                  }
              ],
          },
          {
              "role": "assistant",
              "content": [
                  {
                      "type": "text",
                      "text": sample["label"][0]
                  }
              ],
          },
        ]
    }

dataset_id = "HuggingFaceM4/ChartQA"
train_dataset, eval_dataset, _ = load_dataset(dataset_id, split=['train[:10%]', 'val[:10%]', 'test[:10%]'])
train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]

#model_name = "Qwen/Qwen2.5-VL-7B-Instruct"
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    dtype=torch.bfloat16,
    #attn_implementation="flash_attention_2",
    device_map="auto")

processor = AutoProcessor.from_pretrained(model_name, use_fast=True)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=8,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

# Initialize PromptTuning model
prompt_tune_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.SAMPLE_VOCAB,
    num_virtual_tokens=150,
    tokenizer_name_or_path=model_name
)

output_directory = os.path.join(proj_root, f"{model_name}-lora")
output_directory_prompt_tune = os.path.join(proj_root, f"{model_name}-instruct-prompt-tune")

# Configure training arguments
training_args = SFTConfig(
    output_dir=output_directory_prompt_tune,  # use for prompt tuning
    #output_dir=output_directory # use for lora
    num_train_epochs=3,  # Number of training epochs
    per_device_train_batch_size=2,  # Batch size for training
    per_device_eval_batch_size=2,  # Batch size for evaluation
    gradient_accumulation_steps=8,  # Steps to accumulate gradients
    gradient_checkpointing=False,
    #gradient_checkpointing_kwargs={"use_reentrant": False},  # Options for gradient checkpointing
    max_length=2048,
    # Optimizer and scheduler settings
    optim="adamw_torch_fused",  # Optimizer type
    learning_rate=1e-3,  # Learning rate for training
    #learning_rate=1e-3, #learning rate for prompt tuning
    # Logging and evaluation
    logging_steps=10,  # Steps interval for logging
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",  # Strategy for evaluation
    save_strategy="steps",  # Strategy for saving the model
    save_steps=20,  # Steps interval for saving
    # Mixed precision and gradient settings
    bf16=True,  # Use bfloat16 precision
    max_grad_norm=0.3,  # Maximum norm for gradient clipping
    warmup_ratio=0.03,  # Ratio of total steps for warmup
    push_to_hub=False,  # Whether to push model to Hugging Face Hub
)

#------- This is used for training either LoRA or prompt tuning, depending on peft_config parameter -----------
# trained_model = PeftModel.from_pretrained(model, os.path.join(output_directory, "checkpoint-153"), device_map="auto", is_trainable=False)


trainer = SFTTrainer(
    #model=trained_model,   #use for prompt tuning with a trained lora checkpoint
    model=model,  #use for lora
    args=training_args,
    train_dataset=train_dataset,
    #peft_config=peft_config, #use for lora
    peft_config=prompt_tune_config,  #use for prompt tuning
    eval_dataset=eval_dataset,
    processing_class=processor,
)

trainable_params=trainer.get_num_trainable_parameters()
print(f"Trainable parameters: {trainable_params}")
print(trainer.model)

trainer.train()

trainer.save_model(training_args.output_dir)

loaded_model = PeftModel.from_pretrained(model, os.path.join(training_args.output_dir, "checkpoint-153"), device_map="auto", is_trainable=False)


print(eval_dataset[15])
print("\n\n")
inputs = process_inputs(eval_dataset[15]["messages"])
response = generate_response(inputs, loaded_model)
print(response)

It OOMs for me after a couple some steps (probably due to very long input sequences), so I never finished training, but intermediate results look reasonable.

I think the most likely explanation for your situation is that prompt-tuning has difficulties learning on your dataset because it's not powerful enough. Some hyper-parameter tweaks could fix that, but I would recommend sticking with LoRA. If the LoRA results are not good enough, I would try to tweak LoRA instead to get better results (e.g. higher rank) instead of adding a second training stage with prompt-tuning, for the reasons given in my last reply.

BenjaminBossan avatar Nov 07 '25 16:11 BenjaminBossan

To clarify: by results look reasonable, do you mean the actual output that you got with this script using

inputs = process_inputs(eval_dataset[15]["messages"][1:2]) #addition from prev script: the [1:2]
response = generate_response(inputs, loaded_model)
print(response)

this response looks reasonable? I tried to recreate this and using the ChartQA dataset and the 3B model after training for 160 steps the response is still quite curious:

/lib/python3.10/site-packages/peft/peft_model.py:2141: UserWarning: Position ids are not supported for parameter efficient tuning. Ignoring position ids.
  warnings.warn("Position ids are not supported for parameter efficient tuning. Ignoring position ids.")
["1:'<tool_call>\n<tool_call>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nn\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

and then the latter part is repeated around as much as my new_tokens that I set in

generated_ids = model.generate(**inputs, max_new_tokens=1500)

Could you maybe tell me at what training step you evaluated and how you got your model response?

Martin9797 avatar Nov 11 '25 10:11 Martin9797

Using my script above, I set max_steps=10 to avoid OOM, then evaluated the model like so:

def process_inputs(conversation):
    # Preparation for inference
    text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs, video_kwargs = process_vision_info(conversation, image_patch_size=16, return_video_kwargs=True)
    video_kwargs["fps"] = 30
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    return inputs.to("cuda")

inputs = process_inputs(eval_dataset[15]["messages"])
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=inputs['input_ids'],
        pixel_values=inputs['pixel_values'],
        image_grid_thw=inputs['image_grid_thw'],
        max_new_tokens=500,
    )
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

This prints:

system
You are a vision language model designed to find objects
user
What's the total sum of peak points of green and red lines?
assistant
87
assistant
The total sum of the peak points of the green (mostly good news) and red (mostly bad news) lines is 19 + 80 = 99.

If I remove skip_special_tokens=True, there are a bunch of padding tokens:

"<|im_start|>system\nYou are a vision language model designed to find objects<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|image_pad|>
[...]
<|image_pad|><|image_pad|><|vision_end|>What's the total sum of peak points of green and red lines?<|im_end|>\n<|im_start|>assistant\n87<|im_end|>\n<|im_start|>assistant\nThe total sum of the peak points of the green (mostly good news) and red (mostly bad news) lines is 19 + 80 = 99.<|im_end|>

Not sure if that's how it's supposed to be or not.

BenjaminBossan avatar Nov 11 '25 13:11 BenjaminBossan

And the model in generated_ids = model.generate( is being loaded in with model = PeftModel.from_pretrained(model, os.path.join(training_args.output_dir, "checkpoint-10"), device_map="auto", is_trainable=False)?

When I do this, which should now be exactly your setup:

loaded_model = PeftModel.from_pretrained(model, os.path.join(training_args.output_dir, "checkpoint-10"), device_map="auto", is_trainable=False)


def process_inputs(conversation):
    # Preparation for inference
    text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs, video_kwargs = process_vision_info(conversation, image_patch_size=16, return_video_kwargs=True)
    video_kwargs["fps"] = 30
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
        **video_kwargs,
    )
    return inputs.to("cuda")

inputs = process_inputs(eval_dataset[15]["messages"])
with torch.no_grad():
    generated_ids = loaded_model.generate(
        input_ids=inputs['input_ids'],
        pixel_values=inputs['pixel_values'],
        image_grid_thw=inputs['image_grid_thw'],
        max_new_tokens=500,
    )
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

I get the printout:


system
You are a vision language model designed to spot anomalies and graphical errors in computer generated images and 3D renderings
user
What's the total sum of peak points of green and red lines?
assistant
87
assistant
The
(1)

Martin9797 avatar Nov 11 '25 15:11 Martin9797

Ah, I had a mistake in my code, I used model.generate but it should have been trainer.model.generate, as model is just the base model without prompt tuning. If I use trainer.model.generate I do get gibberish like you do (saving and loading the model makes no difference).

As mentioned earlier, I still think that prompt tuning might not be a good fit here, as it is isn't as powerful as LoRA and the dataset might be too hard for it to learn. Also note that LoRA, by default, is initialized as an identity transform, i.e. starts out generating the same output as the base model and then gradually learns to improve it. With prompt tuning, this is not the case, so it might take a while to train the model to even get to the same level as the base model, much less improve upon it.

I did try to find some better hyper-parameters, but I can't train the model well because of the sequence length issue (setting max_length to a lower value results in an error for this model). It's still possible that with the correct hyper-parameters, prompt-tuning could be made to work, but it just doesn't seem like a worthwhile effort to me.

BenjaminBossan avatar Nov 11 '25 16:11 BenjaminBossan

I may have found the issue: If you set use_cache=False the generation is much more in line with what I would expect and there is no longer gibberish output. This is the code section to change:

inputs = process_inputs(eval_dataset[15]["messages"][0:2]) # using [0:2] should prevent ground truth leaking
with torch.no_grad():
    generated_ids = loaded_model.generate(
        input_ids=inputs['input_ids'],
        pixel_values=inputs['pixel_values'],
        image_grid_thw=inputs['image_grid_thw'],
        max_new_tokens=500,
        use_cache=False, # This seems to be necessary when using prompt tuned models
    )
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])

This was not necessary for LoRA tuned models, but seems to be necessary for Prompt tuned models.

Martin9797 avatar Nov 19 '25 21:11 Martin9797

Nice catch, thanks for reporting. Indeed, for training there is no need for use_cache. IIUC, it will be disabled automatically when training with transformers starting with v5.

BenjaminBossan avatar Nov 20 '25 10:11 BenjaminBossan