Fix continue_final_message for image-text-to-text chat templates

Open yonigozlan opened this issue 1 year ago • 1 comments

What does this PR do?

The content field for an image-text-to-text model is a list, which is not currently taken into account when continue_final_message is set to True in tokenization_utils_base. Split from image-text-to-text PR

Reproduce error:

from transformers import LlavaProcessor, LlavaForConditionalGeneration
import torch
from PIL import Image
import requests

processor = LlavaProcessor.from_pretrained("llava-hf/llava-interleave-qwen-0.5b-hf")

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-interleave-qwen-0.5b-hf", torch_dtype=torch.float16, low_cpu_mem_usage=True)
model.to("cuda:0")


# Define a chat history and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "There is a dog and"},
        ],
    },
]
prompt = processor.apply_chat_template(messages, continue_final_message=True)

inputs = processor(text=prompt, return_tensors="pt").to("cuda:0").to(torch.float16)

# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)

print(processor.decode(output[0], skip_special_tokens=True))

@zucchini-nlp @ArthurZucker

Oct 18 '24 08:10 yonigozlan

Added one test for llava processor :). I could add one for every vlms processor that use chat template, but as they all use the same underlying apply_chat_template, I thought it was not worth the diffs. Wdyt?

Oct 18 '24 19:10 yonigozlan