Batched forward with multiple image of different sizes (number of patches).
Support training, for cases without any image.
Support multi-image in same sequence. e.g: ["<image> <image> the first image is a dog while the second is a cat", "<image> <image> <image> <image> these 4 image are..."]
Support batched generation (require padding_side = "left")


device = "cuda:0"
model_path = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_path)
processor.tokenizer.padding_side = "left"

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="cuda",
)


# ! Differnt images, same prompt

cat_img = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
chart_img = Image.open(requests.get("https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true", stream=True).raw)

prompts = [
    "[INST] <image>\nWhat is shown in this image? [/INST]",
    "[INST] <image>\nWhat is shown in this image? [/INST]",
]
inputs = processor(prompts, [chart_img, cat_img], return_tensors='pt', padding=True).to("cuda")
processor.tokenizer.padding_side = "left"

# just in case, pass padding_side = "left" to generate
output = model.generate(**inputs, max_new_tokens=1024, do_sample=False, pad_token_id=processor.tokenizer.pad_token_id, padding_side=processor.tokenizer.padding_side)

for o in output:
    print(processor.decode(o, skip_special_tokens=True))

# expected output
"""
[INST]  
What is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.

The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet-GQA, MM-Vet-GQA-VizWiz, LLaVa-Bench, SLED-Bench, and several others. Each axis is labeled with the name of the metric and a numerical value, which likely represents a score or a performance measure.

The colored areas within the chart represent different models or systems, such as MME, BLIP-2, InstructionBLIP, and others. The size of the area on each axis indicates the performance of the model or system on that particular metric.

The chart is color-coded to differentiate between the different models or systems, and it provides a visual comparison of their performance across the various metrics. This kind of chart is often used in machine learning and artificial intelligence to compare the performance of different models or algorithms. 
[INST]  
What is shown in this image? [/INST] The image shows two cats lying on a pink blanket. The cat on the left is curled up in a relaxed position, while the cat on the right is stretched out with its head resting on the blanket. There is a remote control next to the cat on the left, suggesting that this scene might be taking place in a living room or a similar space where people might watch television. The cats appear to be sleeping or resting. 
"""

What does this PR do?

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Mar 25 '24 11:03 nxphi47

Thanks for your PR, feel free to add the changes to modeling_llava_next.py so that we can see an easier diff

Mar 25 '24 14:03 NielsRogge

Just tried out your implementation from a new branch I made.

It gives me the following outputs (by running python src/transformers/models/llava_next/test.py) : """ ['[INST] [INST] \nWhat is shown in this image? [/INST] [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot that displays data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.\n\nIn this particular radar chart, there are several axes labeled with different metrics or benchmarks, such as "MM-Vet," "MM-Bench," "LLa-Va-Bench," "LLa-Va-B', '[INST] [INST] \nHow many cats are there? [/INST] [/INST] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 cats are two cats are there are there are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats are two cats'] """

Whereas the original implementation gives me the following on the same inputs (by running CUDA_VISIBLE_DEVICES=0 python llava/eval/run_llava_batched_inference.py --model-path "liuhaotian/llava-v1.6-mistral-7b" --image-file "images/llava_v1_5_radar.jpg" --query "What is shown in this image?" from this branch):

""" ['The image appears to be a radar chart, which is a type of multi-dimensional plot that displays data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.\n\nIn this particular radar chart, there are several axes labeled with different metrics or benchmarks, such as "MM-Vet," "MM-Bench," "LLa-Va-Bench," "LLa-Va-B', 'There are two cats in the image.'] """ Note that the inputs use right padding, I pushed them here: https://huggingface.co/datasets/nielsr/llava-batched-inference/tree/main

So it looks like the first one in the batch is correct, whereas the second one (regarding the cats image) goes off the rails.

I've also noticed that the padding token ID is currently set wrongly for the tokenizer of https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf, it should be <unk> instead of <pad>

Mar 25 '24 21:03 NielsRogge

@NielsRogge Thanks. Let me check it out. I thought batched generation require left-padding, unless the 2 of the sample are exactly same # of tokens, because otherwise pad tokens will be in the middle of the generation?

Mar 26 '24 01:03 nxphi47

@NielsRogge I thought the pad token id being wrong was fixed as mentioned by you in https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/discussions/2

Mar 26 '24 04:03 aliencaocao

@NielsRogge I have added batched generation with left padding in the latest commit. Try it here:

import torch
from huggingface_hub import hf_hub_download
import requests
from PIL import Image

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from transformers.models.llava_next.modeling_better_llava_next import BetterLlavaNextForConditionalGeneration
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
processor.tokenizer.padding_side = "left"

model = BetterLlavaNextForConditionalGeneration.from_pretrained(
    "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="cuda",
)

# ! Chart and cat
cat_img = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
chart_img = Image.open(requests.get("https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true", stream=True).raw)


prompts = [
    "[INST] <image>\nWhat is shown in this image? [/INST]",
    "[INST] <image>\nWhat is shown in this image? [/INST]"
]
inputs = processor(prompts, [chart_img, cat_img], return_tensors='pt', padding=True).to("cuda")

output = model.generate(**inputs, max_new_tokens=20, do_sample=False, pad_token_id=processor.tokenizer.pad_token_id)

for o in output:
    print(processor.decode(o, skip_special_tokens=True))

"""output
[INST]  
What is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multivariate chart that
[INST]  
What is shown in this image? [/INST] 2 cats lying on a pink blanket. The cat on the left is stretching and has its
"""

However, I still find that batch=1 and batch=2 still has inconsistency issue. So need to debug further

Mar 26 '24 07:03 nxphi47

We got repetition degeneration when one item in the batch has finish generating.

# batch=2
[INST]  
What is shown in this image? [/INST] 2 cats lying on a pink blanket. The cat on the left is stretching and has its paws extended, while the cat on the right is curled up and appears to be sleeping. There is a remote control next to the cats, and a red couch is visible in the background. 
[INST]  
What is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.

The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-Vet, MM-V

batch=1

[INST]  
What is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.

The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet-GQA, MM-Vet-GQA-VizWiz, LLaVa-Bench, SLED-Bench, and several others. Each axis is labeled with the name of the metric and a numerical value, which likely represents a score or a performance measure.

The colored areas within the chart represent different models or systems, such as MME, BLIP-2, InstructionBLIP, and others. The size of the area and the position of the models on the axes indicate their performance on each metric.

The chart is color-coded to differentiate between the different models or systems, and it provides a visual comparison of their performance across the various benchmarks. This kind of chart is often used in machine learning and artificial intelligence to compare the performance of different models or algorithms.

So suspect it has something to do with this part https://github.com/huggingface/transformers/blob/b98581b09521b47fc597d2077819895ab4c059dd/src/transformers/models/llava_next/modeling_better_llava_next.py#L950

Mar 26 '24 07:03 nxphi47

Could you try without cached keys/values (by setting model.config.use_cache=False)?

I'm not able to reproduce your code snippet, I get: """ ['[INST] \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes starting from the same point. This particular radar chart is showing the performance of different models or systems across various', '[INST] \nWhat is shown in this image? [/INST] '] """ My branch is here. I'm running the script python src/transformers/models/llava_next/test_bis.py.

Mar 26 '24 07:03 NielsRogge

@NielsRogge Ok, I also got your response when using this PR alone. And I found that this PR and my internal code (which I port the PR from) is same regarding llava_next codes., but different at modeling_mistral.py

This is SdpaAttention. Left is my code, right is most updated transformer code.

My internal code is older and I guess adopted by previous commits. Do you notice if the problem can arise from this mistral difference?

Mar 26 '24 09:03 nxphi47

I got the same results either with or without model.config.use_cache=False

Mar 26 '24 09:03 nxphi47

@aliencaocao I discussed this offline with @ArthurZucker, the padding token was added for llava 1.5 as at the time the original repository didn't support batched generation, and a padding token didn't seem to be used. Having a clear separation between an unk and padding token was the reason why it was added. Hence the same was done for llava 1.6.

However I'm now able to perform batched inference, and it uses <unk> as padding token with right padding, which is why I think we may need to update the tokenizer.

Edit: I don't think right padding makes sense for batched generation, one should use left padding. I assume it's fine to use <pad> instead of <unk> for batched generation (we do the same for llava 1.5)

Mar 26 '24 11:03 NielsRogge

@nxphi47 latest transformers' mistral implementation should be exact, is this a sliding_window issue?

Mar 26 '24 12:03 ArthurZucker

@NielsRogge @ArthurZucker I figure it out and has push the updated code. Please give a test! Basically what went wrong:

batched generation always require padding_side = "left"
the line attention_mask = torch.cat((attention_mask, extended_attention_mask), dim=1) should be attention_mask = torch.cat((extended_attention_mask, attention_mask), dim=1) for padding_side = left. Otherwise you will get mask: [1,1,1,0,0,0,0,1,1,1,1]
in _merge_input_ids_with_image_features, sometimes you cannot tell from the inputs whether this is left or right padding.
- e.g1: (same image, different prompt) [f"[INST] <image> What are the things I should be cautious about when I visit here? [/INST]", f"[INST] <image> Describe what you see. [/INST]",]. You can tell padding_side from input_ids.
- eg2: (different image (different visual token), same prompt) ["[INST] <image>\nWhat is shown in this image? [/INST]", "[INST] <image>\nWhat is shown in this image? [/INST]",] . Here, input attention_mask is all 1, but final_attention_mask has padding because image1 and image2 has different visual tokens. In this case, have to pass another arg padding_side to tell it we are doing left-padding, so that it will put the padding correctly.

Check out the test script: tests/models/llava_next/test_llava_next_batched_gen.py

Example run:

device = "cuda:0"
model_path = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_path)
processor.tokenizer.padding_side = "left"

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="cuda",
)


# ! Differnt images, same prompt

cat_img = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
chart_img = Image.open(requests.get("https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true", stream=True).raw)

prompts = [
    "[INST] <image>\nWhat is shown in this image? [/INST]",
    "[INST] <image>\nWhat is shown in this image? [/INST]",
]
inputs = processor(prompts, [chart_img, cat_img], return_tensors='pt', padding=True).to("cuda")
processor.tokenizer.padding_side = "left"
output = model.generate(**inputs, max_new_tokens=1024, do_sample=False, pad_token_id=processor.tokenizer.pad_token_id, padding_side=processor.tokenizer.padding_side)

for o in output:
    print(processor.decode(o, skip_special_tokens=True))

# expected output
"""
[INST]  
What is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multivariate chart that displays values for multiple variables represented on axes starting from the same point. This particular radar chart is showing the performance of different models or systems across various metrics.

The axes represent different metrics or benchmarks, such as MM-Vet, MM-Vet-GQA, MM-Vet-GQA-VizWiz, LLaVa-Bench, SLED-Bench, and several others. Each axis is labeled with the name of the metric and a numerical value, which likely represents a score or a performance measure.

The colored areas within the chart represent different models or systems, such as MME, BLIP-2, InstructionBLIP, and others. The size of the area on each axis indicates the performance of the model or system on that particular metric.

The chart is color-coded to differentiate between the different models or systems, and it provides a visual comparison of their performance across the various metrics. This kind of chart is often used in machine learning and artificial intelligence to compare the performance of different models or algorithms. 
[INST]  
What is shown in this image? [/INST] The image shows two cats lying on a pink blanket. The cat on the left is curled up in a relaxed position, while the cat on the right is stretched out with its head resting on the blanket. There is a remote control next to the cat on the left, suggesting that this scene might be taking place in a living room or a similar space where people might watch television. The cats appear to be sleeping or resting. 
"""

I have also remove the BetterLlavaNext class and integrate it into LlavaNext

Mar 27 '24 12:03 nxphi47

So this would close https://github.com/huggingface/transformers/issues/29832 right? Any idea on if this change could solve https://github.com/huggingface/transformers/issues/29835?

Mar 27 '24 17:03 aliencaocao

Awesome @nxphi47 !! Will take a look tomorrow. Yes @aliencaocao this issue could fix batched generation. The other issue #29835 seems to happen with both llava and llava-next, so that's a different one

Mar 27 '24 17:03 NielsRogge

Thanks, I'm able to reproduce it so it works fine! Great work. Final todo's:

[ ] address comments I made above
[ ] make sure everything is backwards compatible (by running RUN_SLOW=yes pytest tests/models/llava_next)
[ ] make sure CI is green (can be done by running make fixup locally and address reported issues)

When that's done, I'll assign @ArthurZucker for final review

Mar 30 '24 10:03 NielsRogge

@NielsRogge There you go, kindly review the updated code.

Apr 01 '24 02:04 nxphi47

The current implementation of the processor no longer support multi-image in same sequence. I would suggest we support that by add another num_images dimension to pixel_values (batch_size, num_images, num_patches, 3, 336, 336) and image_sizes (batch_size, num_images, 2) with zero padding.

Apr 01 '24 03:04 nxphi47

@NielsRogge is this pad function what you expected? https://github.com/huggingface/transformers/blob/685892cb6c13e0d17043d58f8997011b6fd75823/src/transformers/models/llava_next/image_processing_llava_next.py#L449

Apr 11 '24 07:04 nxphi47

It's in the right direction, but any image processing method should always take a numpy array as input and produce a numpy array. In case of padding, one can leverage the pad method available in the image_transforms module. An example is this method.

Also wondering @amyeroberts why pad was deprecated for this model in favor of pad_image

Apr 11 '24 07:04 NielsRogge

It's in the right direction, but any image processing method should always take a numpy array as input and produce a numpy array. In case of padding, one can leverage the pad method available in the image_transforms module. An example is this method.

Also wondering @amyeroberts why pad was deprecated for this model in favor of pad_image

@NielsRogge Can you be more specific by providing a concrete example for llava_next ? How pad should be implemented, given that it should receive a list of np array for multiple images to be padded into a single np.array ? I couldn't draw a correlation given donut or DETR examples.

Apr 11 '24 07:04 nxphi47

Ok I see, in that case we can just call the method _pad_pixel_values I assume. I'll let @amyeroberts confirm.

Apr 11 '24 07:04 NielsRogge

Discussed this offline with @zucchini-nlp, we'd like to have batched generation of llava-next to be available by the next Transformers release (given that it's the current best open-source VLM along with Idefics-2), @nxphi47 do you think you're able to address the remaining comments + having a green CI (by running make fixup)? Else we're happy to help and get this merged quickly

Apr 24 '24 19:04 NielsRogge

Also wondering @amyeroberts why pad was deprecated for this model in favor of pad_image

@NielsRogge pad was deprecated across the feature extractors because the API was inconsistent: some pad implementations would pad a single image, some a batch of images; as well as input arguments. There were also inconsistent return types: sometimes an image, sometimes a batch of images; sometimes a batchfeature. All standard the image transforms should follow the pattern of taking and returning a single numpy array image.

To make this more consistent, pad_image is used for a single image. pad is implemented for some image processors in the case when it takes a batch of images, which separately calls the pad_image method.

Apr 24 '24 19:04 amyeroberts

@NielsRogge @zucchini-nlp Resolved the issues. Should be ready to merge.

Apr 25 '24 03:04 nxphi47

Thanks, made some final comments, after that I'll assign Amy for approval.

Apr 25 '24 06:04 NielsRogge

Thanks, made some final comments, after that I'll assign Amy for approval.

Added the "remove comments" update

Apr 25 '24 07:04 nxphi47

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 25 '24 10:04 HuggingFaceDocBuilderDev

Hey, guys! @nxphi47 @NielsRogge I just pulled this branch and facing problems with generation with repetition_penalty parameter. Im using vicuna version of LLaVA

Generation without penalty works as it should

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from PIL import Image
import requests
import torch

model_path = "llava-hf/llava-v1.6-vicuna-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_path)
processor.tokenizer.padding_side = "left"

model = LlavaNextForConditionalGeneration.from_pretrained(
    model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="cuda",
)

cat_img = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
chart_img = Image.open(requests.get("https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true", stream=True).raw)

prompts = [
    'A chat between a curious human and an artificial intelligence assistant. '
    'The assistant gives helpful, detailed, and polite answers to the human\'s questions. '
    'USER: <image>\nWhat is shown in this image? ASSISTANT:'
    for _ in range(2)
]
inputs = processor(prompts, [cat_img, chart_img], return_tensors='pt', padding=True).to(model.device)
output = model.generate(**inputs, max_new_tokens=10, do_sample=False, pad_token_id=processor.tokenizer.pad_token_id)
for o in output:
    print(processor.decode(o, skip_special_tokens=False), '\n')

# output
"""
<s> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> 
What is shown in this image? ASSISTANT:
The image shows two cats lying on a 

<s> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> 
What is shown in this image? ASSISTANT: The image appears to be a graphical representation of 
"""

Now let's generate the same with repetition_penalty. There will be unknown tokens - zeros in output

output = model.generate(**inputs, max_new_tokens=10, do_sample=False, pad_token_id=processor.tokenizer.pad_token_id,
                       repetition_penalty=1.5)
for o in output:
    print(processor.decode(o, skip_special_tokens=False), '\n')
# output
"""
<s> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> 
What is shown in this image? ASSISTANT:</s><unk><unk><unk><unk><unk><unk><unk><unk><unk> 

<s> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> 
What is shown in this image? ASSISTANT: This appears to be a graphical representation of some 
"""

In that time if we will pass first image twice, with penalty, all will be good

inputs = processor(prompts, [cat_img, cat_img], return_tensors='pt', padding=True).to(model.device)
output = model.generate(**inputs, max_new_tokens=10, do_sample=False, pad_token_id=processor.tokenizer.pad_token_id,
                       repetition_penalty=1.5)
for o in output:
    print(processor.decode(o, skip_special_tokens=False), '\n')
# output
"""
<s> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> 
What is shown in this image? ASSISTANT: This photo shows two cats lying on their sides 

<s> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> 
What is shown in this image? ASSISTANT: This photo shows two cats lying on their sides 
"""

Apr 25 '24 14:04 messlav

@messlav Can you try turning this line back to new_token_positions += new_token_positions[:, -1].max() - new_token_positions[:, -1:] ?

https://github.com/huggingface/transformers/blob/a0102a425dc8d01fddf215444aa2e54dfd8b7eb2/src/transformers/models/llava_next/modeling_llava_next.py#L565

Apr 25 '24 15:04 nxphi47

@nxphi47 Didn't help, same behaviour :(

Apr 25 '24 19:04 messlav

transformers
transformers copied to clipboard

Better llava next.

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

Better llava next.

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard