What does this PR do?

This is a draft PR to show that we can move the _merge_inputs_with_vision_embeds to the processing logics, and thus making VLMs more versatile in terms of generation strategies. Changes are made only in LLaVa based models, and can be expanded to all VLMs if the general idea is okay. All models were tested locally with different batch sizes and img resolutions, the generation is same as it was before making changes.

The main idea is to get sequence length for image features inside the processing files, and expand input ids by repeating special image token. Same is already done for IDEFICS in transformers. Some questions to be addressed for this PR to work properly:

We should patch_size to image processors config, as we need it to calculate number of patches. I've seen two models that have patch_size in their configs, so I guess should not be problem
Not sure about this one. Now we rely on vision_feature_select_strategy for LLaVa model to either subtract 1 from image sequence length or not. We can either add it to the config, or hardcoded because we know all LLaVa models use the "default" strategy.

May 22 '24 12:05 zucchini-nlp

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

May 22 '24 12:05 HuggingFaceDocBuilderDev

Looking forward to see this expanded to other VLMs! Some might be trickier, PaliGemma incorporates causal mask computation in the merge method for instance (thought about that when reading) but it makes sense that most should belong in the processor, not the modeling

May 24 '24 15:05 molbap

@amyeroberts I did some clean-up after Arthur's comments. Requesting review, should be ready. If this works I will expand the logic to BLIP and PaliGemma in the next weeks

What changed:

Model can generate from both: old inputs and new-expanded inputs. If it's old inputs, warning is raised, asking to upgrade the processor config.
Processor also can return both types. If it has all the necessary parts to calculate image embedding length, the inputs will be expanded. Otherwise, warning is raised and old behavior retained.
Old behavior is planned to be totally removed in v4.44 (or better v4.43?)
Added tests to check that old vs new inputs generation is identical
To actually have llava-based models work in new style, I'll later update all hf-llava configs in the hub. Other models in the hub will continue to work with old behavior

May 29 '24 13:05 zucchini-nlp

@amyeroberts addressed the comments and added all VLMs to the PR (excluding Idefics, Fuyu and Kosmos as those already have expansion in processing).

warning text is more clear and it's easy for users to add new attributes to Processor class (with processor.patch_size = patch_size)
BLIP-2 needed more modifications as it didn't have special image token, lmk if the way I did works
Paligemma worked out of the box but needed changes for causal mask. There's also smth weird with "position_ids" which will be fixed by @molbap
All models have their "old-new format equivalence" tests and are passing locally. I don't know how to make happy the failing doctest, it's red even after I deprecated the unused attribute

Jun 10 '24 10:06 zucchini-nlp

This should be done, addressed the comments. For the failing test, I have no idea how to skip it after deprecating a property from config.

Aug 06 '24 09:08 zucchini-nlp

Alright cool, taking a look soon! For the config option, a quick&dirty solution could be to do something like _ = config.ignore_index in the modeling?

Aug 06 '24 09:08 molbap

I'll run slow tests and check everything is okey, will merge some time next week

Aug 09 '24 07:08 zucchini-nlp

I encountered the warning: "Expanding inputs for image tokens in LLaVa should be done in processing. Please add patch_size and vision_feature_select_strategy to the model's processing config or set directly with processor.patch_size = {{patch_size}} and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}. Using processors without these attributes in the config is deprecated and will throw an error in v4.47." Could you please guide me on how to choose the patch_size and vision_feature_select_strategy based on the model? Are there any related documents available? @zucchini-nlp

Oct 22 '24 03:10 pspdada

@pspdada hey! The patch_size and vision_feature_select_strategy should be taken from model config, so that vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size

The message says about deprecation in v4.47 but we encountered a few things to be done so the target version will be moved a few more versions up. After the issues we encountered are fixed, I'll update official checkpoints on the hub. Until then you are welcome to use the old way (and ignore the warning) or try out the new logic by setting necessary params in the processor :)

Oct 22 '24 05:10 zucchini-nlp

vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size

Thank you for your response, but there is an issue. I use transformers 4.46.0.dev0 using pip install --upgrade git+https://github.com/huggingface/transformers.git, which means this pull request has already taken effect (because it has been merged into the main branch).

When using the following code to load the llava 1.5 model and generate with it:

def _create_v_1_5(self) -> tuple[LlavaForConditionalGeneration, LlavaProcessor]:
    model_name = f"llava-hf/llava-1.5-{self.model_size}-hf"

    model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
        model_name,
        cache_dir=self.model_dir,
        torch_dtype=self.torch_dtype,
        device_map="auto",
        low_cpu_mem_usage=True,
        attn_implementation=attn_implementation,
    ).to(self.device).eval()

    processor: LlavaProcessor = LlavaProcessor.from_pretrained(
        model_name,
        cache_dir=self.model_dir,
        padding_side="left",
        vision_feature_select_strategy=model.config.vision_feature_select_strategy,
        patch_size=model.config.vision_config.patch_size,
    )
    print(model.config.vision_feature_select_strategy, model.config.vision_config.patch_size)

    return model, processor

def _gen(
    self,
    images: list[Image.Image],
    prompts: list[str],
    max_token: int,
    do_sample: list,
    temp: float,
    eos_token_id: list[int] | None = None,
) -> list:
    with torch.inference_mode():
        inputs = self.processor(
            images=images,
            text=prompts,
            return_tensors="pt",
            return_token_type_ids=False,
            padding=True,
        ).to(self.device, torch.float16)

        generated_ids = self.model.generate(
            **inputs,
            max_new_tokens=max_token,
            temperature=temp if do_sample else None,
            do_sample=do_sample,
            use_cache=True,
            eos_token_id=eos_token_id,
        )
        decoded_outputs: list[str] = self.processor.batch_decode(
            generated_ids,
            skip_special_tokens=False,
            clean_up_tokenization_spaces=False,
        )

    return decoded_outputs

The output is: default 14 and an error occured. An error occurs:

File "/root/llm-project/LVLM/model/llava.py", line 202, in _gen
    generated_ids = self.model.generate(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 2220, in generate
    result = self._sample(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 3211, in _sample
    outputs = self(**model_inputs, return_dict=True)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 524, in forward
    raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 325440, features 576

If I remove those two lines, everything works fine.

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

Is this a phenomenon that under control? or should I open a new issue about it and provide full infomation?

Oct 22 '24 14:10 pspdada

I ues the following code to decode the output from the model.generate:

new_sentence = decoded_output.strip('<pad>').strip('<s>').strip()[len(prompt) + 1:].strip('</s>').strip()

An interesting phenomenon is that the original output was:

The image features a man and a woman standing close to each other, posing for a picture.

However, after adding these two lines:

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

the output becomes particularly strange:

...ge><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> 
Please describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture.

Oct 22 '24 14:10 pspdada

@pspdada I just opened a PR for that (https://github.com/huggingface/transformers/pull/34332). It is a totally unrelated issue and should be solved soon

Oct 23 '24 06:10 zucchini-nlp

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

Thank you for your attention, and I wish for a good outcome.

Oct 23 '24 06:10 pspdada

@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon

I've discovered a new issue after the merge of #34332. In the latest version of transformers==4.46.2, problems still occur when I set:

vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,

After setting these parameters, the result of batch inference (without truncation, directly using batch_decode output) is as follows:

['<s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>', 
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

However, when I remove these two lines, everything works as expected, and the result is:

['<s> USER: <image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
 '<pad><pad><pad><pad><pad><pad><pad><s> USER: <image> \nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

Shall I open a new issue about it and provide full infomation? @zucchini-nlp

Nov 09 '24 16:11 pspdada

@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens

Nov 15 '24 12:11 zucchini-nlp

@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens

@zucchini-nlp But the output of the model seems to be not quite right this way. Look at my previous example; if we remove the <Image> and <pad>, what remains is:

 \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s>

If I don't set the two line, it will be:

\nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s>

Nov 15 '24 12:11 pspdada

@pspdada so the new logic generates gibberish when the inputs are batched? I don't see such behavior with the latest release. Can you verify if the padding side if set to left as it is recommended in docs (processor.tokenizer.padding_side = "left")

Nov 15 '24 12:11 zucchini-nlp

Expand inputs in processors for VLMs

What does this PR do?