What does this PR do?

Add image-text-to-text pipeline!

A split of this PR with only model-specific pre and post processing is available here, in order to reduce the loc count and number of files changed before merging this PR.

Note: The use of a "legacy" kwarg to modify the preprocessing of some image-text-to-text models is needed here if we want to integrate those models into this pipeline. However, the way it is handled might not be ideal, so I'm open to suggestion on how to improve this.

the pipeline support the following inputs:

unbatched images and text - images=image, text=text
batched images and text - images = [image, image], text= [text, text]
several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."]
Chat templates (for models supporting them).

TODOs:

[x] Add pipeline tests in model-specific test files
[ ] Update tasks documentation?

Known current limitations/bugs:

Using prompts without image tokens with models that expect them will throw an error. Should we automatically add image tokens to prompts and display a warning? For now, only a warning is displayed
Using several images per prompt for models who do not support the use of an image token) will raise an uncaught error.
Donut doesn't work, as there is a problem identifying the correct model type for it
Idefics3 will raise an uncaught error if no correct image tokens are provided
Pixtral with batched input raises Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with pipe.tokenizer.pad_token_id = model.config.eos_token_id.

Examples of usage:

>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
>>> text = "<image> What this is? Assistant: This is"
>>> pipe(image, text=text, max_new_tokens=20)
[
    [
        {
            "input_text": "<image> What this is? Assistant: This is",
            "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
        }
    ]
],


>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>>     {
>>>         "role": "user",
>>>         "content": [
>>>             {
>>>                 "type": "image",
>>>                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>>             },
>>>             {"type": "text", "text": "Describe this image."},
>>>         ],
>>>     }
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
>>> print(outputs[0]["generated_text"])
"In the image, a woman is sitting on the sandy beach, her legs crossed in a relaxed manner"

>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>>     {
>>>         "role": "user",
>>>         "content": [
>>>             {
>>>                 "type": "image",
>>>                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>>             },
>>>             {"type": "text", "text": "Describe this image."},
>>>         ],
>>>     },
>>>     {
>>>         "role": "assistant",
>>>         "content": [
>>>             {"type": "text", "text": "There is a dog and"},
>>>         ],
>>>     },
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20)
>>> print(outputs[0]["generated_text"])
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "There is a dog and a person in the image. The dog is sitting on the sand, and the person is sitting on",
            }
        ],
    },
]

Who can review?

@Rocketknight1 @molbap @qubvel @NielsRogge

Oct 15 '24 08:10 yonigozlan

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Oct 15 '24 09:10 HuggingFaceDocBuilderDev

Will it be possible to use this PR for just text generation with a image-capable model? I'm trying to use this PR (at commit 4ac2d1fce81a00d251ae9af75f32b1f821d56296) with meta-llama/Llama-3.2-90B-Vision-Instruct so that I can compare the language capabilities vs Llama 3.1 70B, and I don't need to use the image support.

I tried calling it like this:

pipe = pipeline(
    "image-text-to-text",
    model="meta-llama/Llama-3.2-90B-Vision-Instruct", 
    device_map="auto",
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is 1+1?"},
        ],
    }
]
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
print(outputs[0]["generated_text"])

That resulted in this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:393, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
    392 try:
--> 393     model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
    394         dtype=self.torch_dtype
    395     )
    396 except TypeError:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:285, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
    284 _ = text_kwargs.pop("padding_side", None)  # hack until padding-side is an accepted kwarg by tokenizers
--> 285 encoding = self.tokenizer(text, **text_kwargs)
    286 data.update(encoding)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3020, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   3019         self._switch_to_input_mode()
-> 3020     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   3021 if text_target is not None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3108, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   3107     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 3108     return self.batch_encode_plus(
   3109         batch_text_or_text_pairs=batch_text_or_text_pairs,
   3110         add_special_tokens=add_special_tokens,
   3111         padding=padding,
   3112         truncation=truncation,
   3113         max_length=max_length,
   3114         stride=stride,
   3115         is_split_into_words=is_split_into_words,
   3116         pad_to_multiple_of=pad_to_multiple_of,
   3117         padding_side=padding_side,
   3118         return_tensors=return_tensors,
   3119         return_token_type_ids=return_token_type_ids,
   3120         return_attention_mask=return_attention_mask,
   3121         return_overflowing_tokens=return_overflowing_tokens,
   3122         return_special_tokens_mask=return_special_tokens_mask,
   3123         return_offsets_mapping=return_offsets_mapping,
   3124         return_length=return_length,
   3125         verbose=verbose,
   3126         split_special_tokens=split_special_tokens,
   3127         **kwargs,
   3128     )
   3129 else:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3310, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   3301 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   3302     padding=padding,
   3303     truncation=truncation,
   (...)
   3307     **kwargs,
   3308 )
-> 3310 return self._batch_encode_plus(
   3311     batch_text_or_text_pairs=batch_text_or_text_pairs,
   3312     add_special_tokens=add_special_tokens,
   3313     padding_strategy=padding_strategy,
   3314     truncation_strategy=truncation_strategy,
   3315     max_length=max_length,
   3316     stride=stride,
   3317     is_split_into_words=is_split_into_words,
   3318     pad_to_multiple_of=pad_to_multiple_of,
   3319     padding_side=padding_side,
   3320     return_tensors=return_tensors,
   3321     return_token_type_ids=return_token_type_ids,
   3322     return_attention_mask=return_attention_mask,
   3323     return_overflowing_tokens=return_overflowing_tokens,
   3324     return_special_tokens_mask=return_special_tokens_mask,
   3325     return_offsets_mapping=return_offsets_mapping,
   3326     return_length=return_length,
   3327     verbose=verbose,
   3328     split_special_tokens=split_special_tokens,
   3329     **kwargs,
   3330 )

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'legacy'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[5], line 9
      1 messages = [
      2     {
      3         "role": "user",
   (...)
      7     }
      8 ]
----> 9 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     10 print(outputs[0]["generated_text"])

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
    285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
    286     text[0], (list, tuple, dict)
    287 ):
    288     # We have one or more prompts in list-of-dicts format, so this is chat mode
    290     if isinstance(text[0], dict):
--> 291         return super().__call__(Chat(text, images), **kwargs)
    292     else:
    293         if images is None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1294     return next(
   1295         iter(
   1296             self.get_iterator(
   (...)
   1299         )
   1300     )
   1301 else:
-> 1302     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1308, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
-> 1308     model_inputs = self.preprocess(inputs, **preprocess_params)
   1309     model_outputs = self.forward(model_inputs, **forward_params)
   1310     outputs = self.postprocess(model_outputs, **postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:398, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
    396 except TypeError:
    397     kwargs.pop("legacy", None)
--> 398     model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
    399         dtype=self.torch_dtype
    400     )
    402 model_inputs["text"] = inputs_text
    404 return model_inputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:290, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
    288 n_images_in_images = [0]
    289 if images is not None:
--> 290     images = make_list_of_images(images)
    291     n_images_in_images = [len(sample) for sample in images]
    293 if text is not None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/image_processing_mllama.py:543, in make_list_of_images(images)
    541     output_images = images
    542 else:
--> 543     raise ValueError(
    544         "Invalid input type. Must be a single image, a list of images, or a list of batches of images."
    545     )
    546 return output_images

ValueError: Invalid input type. Must be a single image, a list of images, or a list of batches of images.

I tried running it just as above as well, with an image input, and that resulted in an OutOfMemoryError, which is confusing because the model size is only 166G on disk, and I'm running this in a 4x80G (i.e. 320G) H100 Lambda Labs environment.

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[6], line 23
      1 # messages = [
      2 #     {
      3 #         "role": "user",
   (...)
      9 # outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     10 # print(outputs[0]["generated_text"])
     11 messages = [
     12     {
     13         "role": "user",
   (...)
     21     }
     22 ]
---> 23 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     24 print(outputs[0]["generated_text"])

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
    285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
    286     text[0], (list, tuple, dict)
    287 ):
    288     # We have one or more prompts in list-of-dicts format, so this is chat mode
    290     if isinstance(text[0], dict):
--> 291         return super().__call__(Chat(text, images), **kwargs)
    292     else:
    293         if images is None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1294     return next(
   1295         iter(
   1296             self.get_iterator(
   (...)
   1299         )
   1300     )
   1301 else:
-> 1302     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1309, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1308     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1309     model_outputs = self.forward(model_inputs, **forward_params)
   1310     outputs = self.postprocess(model_outputs, **postprocess_params)
   1311     return outputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1209, in Pipeline.forward(self, model_inputs, **forward_params)
   1207     with inference_context():
   1208         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1209         model_outputs = self._forward(model_inputs, **forward_params)
   1210         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1211 else:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:412, in ImageTextToTextPipeline._forward(self, model_inputs, generate_kwargs)
    408 prompt_text = model_inputs.pop("text")
    409 input_ids = (
    410     model_inputs["input_ids"] if "input_ids" in model_inputs else model_inputs["decoder_input_ids"]
    411 )  # for decoder-only models
--> 412 generated_sequence = self.model.generate(**model_inputs, **generate_kwargs)
    414 return {"generated_sequence": generated_sequence, "prompt_text": prompt_text, "input_ids": input_ids}

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:2208, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2200     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2201         input_ids=input_ids,
   2202         expand_size=generation_config.num_return_sequences,
   2203         is_encoder_decoder=self.config.is_encoder_decoder,
   2204         **model_kwargs,
   2205     )
   2207     # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2208     result = self._sample(
   2209         input_ids,
   2210         logits_processor=prepared_logits_processor,
   2211         stopping_criteria=prepared_stopping_criteria,
   2212         generation_config=generation_config,
   2213         synced_gpus=synced_gpus,
   2214         streamer=streamer,
   2215         **model_kwargs,
   2216     )
   2218 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2219     # 11. prepare beam search scorer
   2220     beam_scorer = BeamSearchScorer(
   2221         batch_size=batch_size,
   2222         num_beams=generation_config.num_beams,
   (...)
   2227         max_length=generation_config.max_length,
   2228     )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:3176, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   3173 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   3175 # forward pass to get next token
-> 3176 outputs = self(**model_inputs, return_dict=True)
   3178 # synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping
   3179 model_kwargs = self._update_model_kwargs_for_generation(
   3180     outputs,
   3181     model_kwargs,
   3182     is_encoder_decoder=self.config.is_encoder_decoder,
   3183 )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:2138, in MllamaForConditionalGeneration.forward(self, input_ids, pixel_values, aspect_ratio_mask, aspect_ratio_ids, attention_mask, cross_attention_mask, cross_attention_states, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   2135     cross_attention_mask = cross_attention_mask[:, :, cache_position]
   2136     full_text_row_masked_out_mask = full_text_row_masked_out_mask[:, :, cache_position]
-> 2138 outputs = self.language_model(
   2139     input_ids=input_ids,
   2140     attention_mask=attention_mask,
   2141     position_ids=position_ids,
   2142     cross_attention_states=cross_attention_states,
   2143     cross_attention_mask=cross_attention_mask,
   2144     full_text_row_masked_out_mask=full_text_row_masked_out_mask,
   2145     past_key_values=past_key_values,
   2146     use_cache=use_cache,
   2147     inputs_embeds=inputs_embeds,
   2148     labels=labels,
   2149     output_hidden_states=output_hidden_states,
   2150     output_attentions=output_attentions,
   2151     return_dict=return_dict,
   2152     cache_position=cache_position,
   2153     num_logits_to_keep=num_logits_to_keep,
   2154 )
   2156 return outputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:1948, in MllamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, cross_attention_states, cross_attention_mask, full_text_row_masked_out_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   1931 outputs = self.model(
   1932     input_ids=input_ids,
   1933     cross_attention_states=cross_attention_states,
   (...)
   1944     cache_position=cache_position,
   1945 )
   1947 hidden_states = outputs[0]
-> 1948 logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
   1950 loss = None
   1951 if labels is not None:
   1952     # Upcast to float if we need to compute the loss to avoid potential precision issues

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164 def new_forward(module, *args, **kwargs):
--> 165     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    166     if module._hf_hook.no_grad:
    167         with torch.no_grad():

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:355, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    347         if (
    348             value is not None
    349             and self.tied_params_map is not None
    350             and value.data_ptr() in self.tied_params_map
    351             and self.execution_device not in self.tied_params_map[value.data_ptr()]
    352         ):
    353             self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
--> 355         set_module_tensor_to_device(
    356             module,
    357             name,
    358             self.execution_device,
    359             value=value,
    360             fp16_statistics=fp16_statistics,
    361             tied_params_map=self.tied_params_map,
    362         )
    364 return send_to_device(args, self.execution_device), send_to_device(
    365     kwargs, self.execution_device, skip_keys=self.skip_keys
    366 )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py:329, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    327             module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
    328 elif isinstance(value, torch.Tensor):
--> 329     new_value = value.to(device)
    330 else:
    331     new_value = torch.tensor(value, device=device)

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 79.10 GiB of which 2.12 GiB is free. Including non-PyTorch memory, this process has 76.97 GiB memory in use. Of the allocated memory 75.56 GiB is allocated by PyTorch, and 761.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Oct 15 '24 16:10 knkski

Thanks for the feedback @knkski! Although it's not really an objective of this pipeline, I think we can try to add support and raise a warning at least, wdyt @Rocketknight1 ? For the memory problem, that's is strange indeed, I will look into that, and if others have an idea of why this is happening feel free to chime in. Do you manage to use this model on your setup without using the pipeline?

Oct 17 '24 08:10 yonigozlan

@yonigozlan I think that's okay! It might result in a bit of crossover with text-generation pipelines, but I think it's fine, and we can deprecate it later and officially move that functionality to text-generation if it's a problem.

Oct 17 '24 12:10 Rocketknight1

@Rocketknight1 @knkski , text-only inference should be supported now :)

Oct 17 '24 15:10 yonigozlan

@yonigozlan Thanks! Works great for me :rocket:

I think the extra memory usage is unrelated to this PR, so ignore that :+1:

Oct 18 '24 16:10 knkski

That might mean we should take these changes and fold them into text-generation instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?

X-posting the slack thread (private) about that convo. IMO better to have both text-generation and image-text-to-text to be consistent with https://huggingface.co/tasks.

Oct 23 '24 15:10 Wauplin

There is still some issues with pipeline tests:

It seems that pipeline model tests are based on "tiny models" available on hf-internal-testing, but those tiny models don't seem to be added anymore for recent vlms, so they are not being tested. I'm not sure if this is or used to be an automatic or manual process, and if we should start adding those tiny models back again.
The Kosmos2 tiny model causes some problems: it's configuration has hyper-parameters that are not compatible with each other, namely latent_query_num=3, which is a model parameter, should be the same as num_image_tokens=64, which is a processor call argument, so can't be set via a json config file (I think?). An easy fix would be to manually change latent_query_num to 64 in the tiny model's config in hf-internal-testing, but that could make the model not so tiny anymore. Or we could skip the test altogether.

Oct 24 '24 22:10 yonigozlan

@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!

Oct 25 '24 12:10 Rocketknight1

@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!

I see, thanks for the explanation! As for adding new tiny model, pipelines use the tiny_model_summary.json file to identify tiny models, but it looks like only one tiny model per model architecture can be present in that file, so I'm not sure how to solve the issue with the Kosmos2 tiny model without modifying the current one.

Oct 25 '24 12:10 yonigozlan

@yonigozlan probably the easiest thing to do, in that case, is just to manually upload a new model, don't add it to tiny_model_summary, and manually set that model in the image-text-to-text tests. You shouldn't need to worry about whatever's in tiny_model_summary.json either way!

Also, I was wrong - some of the tiny models are automatically created, but in this case I think a manual one just for your pipeline will work a lot better.

Oct 25 '24 13:10 Rocketknight1

@ArthurZucker Addressed your comments! I tried to simplify the code a bit without removing any logic. Happy to remove some logic depending on what you think about this https://github.com/huggingface/transformers/pull/34170#discussion_r1816953666

Oct 25 '24 17:10 yonigozlan

Answered on the issue, I think we should simnplify to ship with not checks, add checks in the corresponding processors

Oct 28 '24 12:10 ArthurZucker

Clearly agree that we should rely on the processors. I can make all this logic into a utils or a ProcessorMixin function and adapt it a bit for models requiring some specific input formatiing. Happy to work on that. I do think however this will take quite a bit of time to validate and make sure we handle BC correctly for all processors, and it seems that there's a need for this pipeline to be shipped quickly (cc @NielsRogge on that). So I feel we can either:

Ship the pipeline with no checks and the processors as they are now, with an added warning that each model may have some specific input requirements. Progressively improve the processors afterwards. (That's maybe what you're saying? Just want to confirm)
Keep the checks in the pipeline for now, remove them once the processors are in a good state.
Modify the processors first, delay shipping this pipeline.

Oct 28 '24 13:10 yonigozlan

@ArthurZucker I removed most of the preprocessing logic, hopefully the code should be less messy now :)

Oct 28 '24 14:10 yonigozlan

Ship the pipeline with no checks and the processors as they are now, with an added warning that each model may have some specific input requirements. Progressively improve the processors afterwards. (That's maybe what you're saying? Just want to confirm)

this is IMO the best

Oct 28 '24 15:10 ArthurZucker

Not sure why I didn't see this failure on CircleCI, but running on daily CI runner, I got

FAILED tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_pipeline_image_text_to_text - TypeError: 'int' object is not subscriptable

Oct 28 '24 15:10 ydshieh

OH I see. @yonigozlan , better to rebase (or merge, whatever you prefer) the main branch to have #34391

Oct 28 '24 15:10 ydshieh

Not sure why I didn't see this failure on CircleCI, but running on daily CI runner, I got

FAILED tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_pipeline_image_text_to_text - TypeError: 'int' object is not subscriptable

Thanks! Should be fixed now

Oct 28 '24 20:10 yonigozlan

@ArthurZucker There were some issues with other multimodal pipelines tests where model tests weren't actually being performed previously. I added the tests and made some modifications to owlv2 and fuyu image processors as some backward compatibility wasn't properly handled. This seems to be resolved now, and all tests are passing :)

Oct 30 '24 15:10 yonigozlan

Thanks for all of your inputs! I'll merged this now as the remaining issues/improvements raised seem a bit out of scope for this PR. Just to recap some of the points that were raised:

VLMs processors are not fully consistent in terms of what inputs they accept, and some of them don't catch errors that should be caught. Improvements can be made there that would benefit this pipeline as well. I'll open an issue for this to share it as a known limitation, and I'll start working on it asap :).
Donut doesn't work in this pipeline as processors are not infer in pipelines if they are not in auto.
Chat templates could be applied directly in conversational models' processor instead of users having to manually do so before making a processor call? Chat inputs could be detected as they are list of dicts.
Several pipelines have a way to handle detecting inputs in generated text, and removing or adding it. This could be unified in a util, or in generate with an added "return_input" flag.
Most recent models (and vlms in particular) don't have a "tiny" version uploaded on hf-internal-testing, which means they are not tested by the CI in the different pipelines that support them.

Oct 31 '24 19:10 yonigozlan

Add image text to text pipeline

What does this PR do?

TODOs:

Known current limitations/bugs:

Examples of usage:

Who can review?