Expand inputs in processors for VLMs
What does this PR do?
This is a draft PR to show that we can move the _merge_inputs_with_vision_embeds to the processing logics, and thus making VLMs more versatile in terms of generation strategies. Changes are made only in LLaVa based models, and can be expanded to all VLMs if the general idea is okay. All models were tested locally with different batch sizes and img resolutions, the generation is same as it was before making changes.
The main idea is to get sequence length for image features inside the processing files, and expand input ids by repeating special image token. Same is already done for IDEFICS in transformers. Some questions to be addressed for this PR to work properly:
- We should
patch_sizeto image processors config, as we need it to calculate number of patches. I've seen two models that havepatch_sizein their configs, so I guess should not be problem - Not sure about this one. Now we rely on
vision_feature_select_strategyfor LLaVa model to either subtract 1 from image sequence length or not. We can either add it to the config, or hardcoded because we know all LLaVa models use the "default" strategy.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Looking forward to see this expanded to other VLMs! Some might be trickier, PaliGemma incorporates causal mask computation in the merge method for instance (thought about that when reading) but it makes sense that most should belong in the processor, not the modeling
@amyeroberts I did some clean-up after Arthur's comments. Requesting review, should be ready. If this works I will expand the logic to BLIP and PaliGemma in the next weeks
What changed:
- Model can generate from both: old inputs and new-expanded inputs. If it's old inputs, warning is raised, asking to upgrade the processor config.
- Processor also can return both types. If it has all the necessary parts to calculate image embedding length, the inputs will be expanded. Otherwise, warning is raised and old behavior retained.
- Old behavior is planned to be totally removed in v4.44 (or better v4.43?)
- Added tests to check that old vs new inputs generation is identical
- To actually have llava-based models work in new style, I'll later update all hf-llava configs in the hub. Other models in the hub will continue to work with old behavior
@amyeroberts addressed the comments and added all VLMs to the PR (excluding Idefics, Fuyu and Kosmos as those already have expansion in processing).
- warning text is more clear and it's easy for users to add new attributes to
Processorclass (withprocessor.patch_size = patch_size) - BLIP-2 needed more modifications as it didn't have special image token, lmk if the way I did works
- Paligemma worked out of the box but needed changes for causal mask. There's also smth weird with "position_ids" which will be fixed by @molbap
- All models have their "old-new format equivalence" tests and are passing locally. I don't know how to make happy the failing doctest, it's red even after I deprecated the unused attribute
This should be done, addressed the comments. For the failing test, I have no idea how to skip it after deprecating a property from config.
Alright cool, taking a look soon! For the config option, a quick&dirty solution could be to do something like _ = config.ignore_index in the modeling?
I'll run slow tests and check everything is okey, will merge some time next week
I encountered the warning: "Expanding inputs for image tokens in LLaVa should be done in processing. Please add patch_size and vision_feature_select_strategy to the model's processing config or set directly with processor.patch_size = {{patch_size}} and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}. Using processors without these attributes in the config is deprecated and will throw an error in v4.47."
Could you please guide me on how to choose the patch_size and vision_feature_select_strategy based on the model? Are there any related documents available? @zucchini-nlp
@pspdada hey! The patch_size and vision_feature_select_strategy should be taken from model config, so that vision_feature_select_strategy = model.config.vision_feature_select_strategy and patch_size = model.config.vision_config.patch_size
The message says about deprecation in v4.47 but we encountered a few things to be done so the target version will be moved a few more versions up. After the issues we encountered are fixed, I'll update official checkpoints on the hub. Until then you are welcome to use the old way (and ignore the warning) or try out the new logic by setting necessary params in the processor :)
vision_feature_select_strategy = model.config.vision_feature_select_strategyandpatch_size = model.config.vision_config.patch_size
Thank you for your response, but there is an issue.
I use transformers 4.46.0.dev0 using pip install --upgrade git+https://github.com/huggingface/transformers.git, which means this pull request has already taken effect (because it has been merged into the main branch).
When using the following code to load the llava 1.5 model and generate with it:
def _create_v_1_5(self) -> tuple[LlavaForConditionalGeneration, LlavaProcessor]:
model_name = f"llava-hf/llava-1.5-{self.model_size}-hf"
model: LlavaForConditionalGeneration = LlavaForConditionalGeneration.from_pretrained(
model_name,
cache_dir=self.model_dir,
torch_dtype=self.torch_dtype,
device_map="auto",
low_cpu_mem_usage=True,
attn_implementation=attn_implementation,
).to(self.device).eval()
processor: LlavaProcessor = LlavaProcessor.from_pretrained(
model_name,
cache_dir=self.model_dir,
padding_side="left",
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
)
print(model.config.vision_feature_select_strategy, model.config.vision_config.patch_size)
return model, processor
def _gen(
self,
images: list[Image.Image],
prompts: list[str],
max_token: int,
do_sample: list,
temp: float,
eos_token_id: list[int] | None = None,
) -> list:
with torch.inference_mode():
inputs = self.processor(
images=images,
text=prompts,
return_tensors="pt",
return_token_type_ids=False,
padding=True,
).to(self.device, torch.float16)
generated_ids = self.model.generate(
**inputs,
max_new_tokens=max_token,
temperature=temp if do_sample else None,
do_sample=do_sample,
use_cache=True,
eos_token_id=eos_token_id,
)
decoded_outputs: list[str] = self.processor.batch_decode(
generated_ids,
skip_special_tokens=False,
clean_up_tokenization_spaces=False,
)
return decoded_outputs
The output is: default 14 and an error occured.
An error occurs:
File "/root/llm-project/LVLM/model/llava.py", line 202, in _gen
generated_ids = self.model.generate(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 2220, in generate
result = self._sample(
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/generation/utils.py", line 3211, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/LVLM/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 524, in forward
raise ValueError(
ValueError: Image features and image tokens do not match: tokens: 325440, features 576
If I remove those two lines, everything works fine.
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
Is this a phenomenon that under control? or should I open a new issue about it and provide full infomation?
I ues the following code to decode the output from the model.generate:
new_sentence = decoded_output.strip('<pad>').strip('<s>').strip()[len(prompt) + 1:].strip('</s>').strip()
An interesting phenomenon is that the original output was:
The image features a man and a woman standing close to each other, posing for a picture.
However, after adding these two lines:
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
the output becomes particularly strange:
...ge><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image>
Please describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture.
@pspdada I just opened a PR for that (https://github.com/huggingface/transformers/pull/34332). It is a totally unrelated issue and should be solved soon
@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon
Thank you for your attention, and I wish for a good outcome.
@pspdada I just opened a PR for that (#34332). It is a totally unrelated issue and should be solved soon
I've discovered a new issue after the merge of #34332. In the latest version of transformers==4.46.2, problems still occur when I set:
vision_feature_select_strategy=model.config.vision_feature_select_strategy,
patch_size=model.config.vision_config.patch_size,
After setting these parameters, the result of batch inference (without truncation, directly using batch_decode output) is as follows:
['<s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image><image> \nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']
However, when I remove these two lines, everything works as expected, and the result is:
['<s> USER: <image> \nPlease describe this image in detail and tell me what you see. ASSISTANT: The image features a man and a woman standing close to each other, posing for a picture. The man is wearing a tie, and the woman is wearing a white shirt. They are both smiling and enjoying the moment.\n\nIn the background, there is a TV mounted on the wall, and a few bottles can be seen placed around the room. There are also two other people in the scene, one on the left side and another on the right side of the image.</s>',
'<pad><pad><pad><pad><pad><pad><pad><s> USER: <image> \nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']
Shall I open a new issue about it and provide full infomation? @zucchini-nlp
@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass skip_special_tokens=False to remove all image tokens
@pspdada that is the expected logic for the new processing code, as we expand the text with as many "image" tokens as there will be image embeddings. When decoding the text you should pass
skip_special_tokens=Falseto remove allimagetokens
@zucchini-nlp But the output of the model seems to be not quite right this way. Look at my previous example; if we remove the <Image> and <pad>, what remains is:
\nWhat is in the image? ASSISTANT: Theo, the image\n\nWhat\n\n</s>
If I don't set the two line, it will be:
\nWhat is in the image? ASSISTANT: The image features a group of birds perched on a tree branch.</s>
@pspdada so the new logic generates gibberish when the inputs are batched? I don't see such behavior with the latest release. Can you verify if the padding side if set to left as it is recommended in docs (processor.tokenizer.padding_side = "left")