Add image text to text pipeline
What does this PR do?
Add image-text-to-text pipeline!
A split of this PR with only model-specific pre and post processing is available here, in order to reduce the loc count and number of files changed before merging this PR.
Note: The use of a "legacy" kwarg to modify the preprocessing of some image-text-to-text models is needed here if we want to integrate those models into this pipeline. However, the way it is handled might not be ideal, so I'm open to suggestion on how to improve this.
the pipeline support the following inputs:
- unbatched images and text -
images=image, text=text - batched images and text -
images = [image, image], text= [text, text] - several images per prompt (only for models supporting the use of an image token) -
images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."] - Chat templates (for models supporting them).
TODOs:
- [x] Add pipeline tests in model-specific test files
- [ ] Update tasks documentation?
Known current limitations/bugs:
- Using prompts without image tokens with models that expect them will throw an error. Should we automatically add image tokens to prompts and display a warning? For now, only a warning is displayed
- Using several images per prompt for models who do not support the use of an image token) will raise an uncaught error.
- Donut doesn't work, as there is a problem identifying the correct model type for it
- Idefics3 will raise an uncaught error if no correct image tokens are provided
- Pixtral with batched input raises
Pipeline with tokenizer without pad_token cannot do batching. You can try to set it withpipe.tokenizer.pad_token_id = model.config.eos_token_id.
Examples of usage:
>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
>>> text = "<image> What this is? Assistant: This is"
>>> pipe(image, text=text, max_new_tokens=20)
[
[
{
"input_text": "<image> What this is? Assistant: This is",
"generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
}
]
],
>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>> {
>>> "role": "user",
>>> "content": [
>>> {
>>> "type": "image",
>>> "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>> },
>>> {"type": "text", "text": "Describe this image."},
>>> ],
>>> }
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
>>> print(outputs[0]["generated_text"])
"In the image, a woman is sitting on the sandy beach, her legs crossed in a relaxed manner"
>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>> {
>>> "role": "user",
>>> "content": [
>>> {
>>> "type": "image",
>>> "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>> },
>>> {"type": "text", "text": "Describe this image."},
>>> ],
>>> },
>>> {
>>> "role": "assistant",
>>> "content": [
>>> {"type": "text", "text": "There is a dog and"},
>>> ],
>>> },
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20)
>>> print(outputs[0]["generated_text"])
[
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
},
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "There is a dog and a person in the image. The dog is sitting on the sand, and the person is sitting on",
}
],
},
]
Who can review?
@Rocketknight1 @molbap @qubvel @NielsRogge
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Will it be possible to use this PR for just text generation with a image-capable model? I'm trying to use this PR (at commit 4ac2d1fce81a00d251ae9af75f32b1f821d56296) with meta-llama/Llama-3.2-90B-Vision-Instruct so that I can compare the language capabilities vs Llama 3.1 70B, and I don't need to use the image support.
I tried calling it like this:
pipe = pipeline(
"image-text-to-text",
model="meta-llama/Llama-3.2-90B-Vision-Instruct",
device_map="auto",
)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What is 1+1?"},
],
}
]
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
print(outputs[0]["generated_text"])
That resulted in this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:393, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
392 try:
--> 393 model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
394 dtype=self.torch_dtype
395 )
396 except TypeError:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:285, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
284 _ = text_kwargs.pop("padding_side", None) # hack until padding-side is an accepted kwarg by tokenizers
--> 285 encoding = self.tokenizer(text, **text_kwargs)
286 data.update(encoding)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3020, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
3019 self._switch_to_input_mode()
-> 3020 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
3021 if text_target is not None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3108, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
3107 batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 3108 return self.batch_encode_plus(
3109 batch_text_or_text_pairs=batch_text_or_text_pairs,
3110 add_special_tokens=add_special_tokens,
3111 padding=padding,
3112 truncation=truncation,
3113 max_length=max_length,
3114 stride=stride,
3115 is_split_into_words=is_split_into_words,
3116 pad_to_multiple_of=pad_to_multiple_of,
3117 padding_side=padding_side,
3118 return_tensors=return_tensors,
3119 return_token_type_ids=return_token_type_ids,
3120 return_attention_mask=return_attention_mask,
3121 return_overflowing_tokens=return_overflowing_tokens,
3122 return_special_tokens_mask=return_special_tokens_mask,
3123 return_offsets_mapping=return_offsets_mapping,
3124 return_length=return_length,
3125 verbose=verbose,
3126 split_special_tokens=split_special_tokens,
3127 **kwargs,
3128 )
3129 else:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3310, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
3301 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
3302 padding=padding,
3303 truncation=truncation,
(...)
3307 **kwargs,
3308 )
-> 3310 return self._batch_encode_plus(
3311 batch_text_or_text_pairs=batch_text_or_text_pairs,
3312 add_special_tokens=add_special_tokens,
3313 padding_strategy=padding_strategy,
3314 truncation_strategy=truncation_strategy,
3315 max_length=max_length,
3316 stride=stride,
3317 is_split_into_words=is_split_into_words,
3318 pad_to_multiple_of=pad_to_multiple_of,
3319 padding_side=padding_side,
3320 return_tensors=return_tensors,
3321 return_token_type_ids=return_token_type_ids,
3322 return_attention_mask=return_attention_mask,
3323 return_overflowing_tokens=return_overflowing_tokens,
3324 return_special_tokens_mask=return_special_tokens_mask,
3325 return_offsets_mapping=return_offsets_mapping,
3326 return_length=return_length,
3327 verbose=verbose,
3328 split_special_tokens=split_special_tokens,
3329 **kwargs,
3330 )
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'legacy'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[5], line 9
1 messages = [
2 {
3 "role": "user",
(...)
7 }
8 ]
----> 9 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
10 print(outputs[0]["generated_text"])
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
286 text[0], (list, tuple, dict)
287 ):
288 # We have one or more prompts in list-of-dicts format, so this is chat mode
290 if isinstance(text[0], dict):
--> 291 return super().__call__(Chat(text, images), **kwargs)
292 else:
293 if images is None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1294 return next(
1295 iter(
1296 self.get_iterator(
(...)
1299 )
1300 )
1301 else:
-> 1302 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1308, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
-> 1308 model_inputs = self.preprocess(inputs, **preprocess_params)
1309 model_outputs = self.forward(model_inputs, **forward_params)
1310 outputs = self.postprocess(model_outputs, **postprocess_params)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:398, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
396 except TypeError:
397 kwargs.pop("legacy", None)
--> 398 model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
399 dtype=self.torch_dtype
400 )
402 model_inputs["text"] = inputs_text
404 return model_inputs
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:290, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
288 n_images_in_images = [0]
289 if images is not None:
--> 290 images = make_list_of_images(images)
291 n_images_in_images = [len(sample) for sample in images]
293 if text is not None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/image_processing_mllama.py:543, in make_list_of_images(images)
541 output_images = images
542 else:
--> 543 raise ValueError(
544 "Invalid input type. Must be a single image, a list of images, or a list of batches of images."
545 )
546 return output_images
ValueError: Invalid input type. Must be a single image, a list of images, or a list of batches of images.
I tried running it just as above as well, with an image input, and that resulted in an OutOfMemoryError, which is confusing because the model size is only 166G on disk, and I'm running this in a 4x80G (i.e. 320G) H100 Lambda Labs environment.
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
Cell In[6], line 23
1 # messages = [
2 # {
3 # "role": "user",
(...)
9 # outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
10 # print(outputs[0]["generated_text"])
11 messages = [
12 {
13 "role": "user",
(...)
21 }
22 ]
---> 23 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
24 print(outputs[0]["generated_text"])
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
286 text[0], (list, tuple, dict)
287 ):
288 # We have one or more prompts in list-of-dicts format, so this is chat mode
290 if isinstance(text[0], dict):
--> 291 return super().__call__(Chat(text, images), **kwargs)
292 else:
293 if images is None:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1294 return next(
1295 iter(
1296 self.get_iterator(
(...)
1299 )
1300 )
1301 else:
-> 1302 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1309, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
1308 model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1309 model_outputs = self.forward(model_inputs, **forward_params)
1310 outputs = self.postprocess(model_outputs, **postprocess_params)
1311 return outputs
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1209, in Pipeline.forward(self, model_inputs, **forward_params)
1207 with inference_context():
1208 model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1209 model_outputs = self._forward(model_inputs, **forward_params)
1210 model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
1211 else:
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:412, in ImageTextToTextPipeline._forward(self, model_inputs, generate_kwargs)
408 prompt_text = model_inputs.pop("text")
409 input_ids = (
410 model_inputs["input_ids"] if "input_ids" in model_inputs else model_inputs["decoder_input_ids"]
411 ) # for decoder-only models
--> 412 generated_sequence = self.model.generate(**model_inputs, **generate_kwargs)
414 return {"generated_sequence": generated_sequence, "prompt_text": prompt_text, "input_ids": input_ids}
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
113 @functools.wraps(func)
114 def decorate_context(*args, **kwargs):
115 with ctx_factory():
--> 116 return func(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:2208, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
2200 input_ids, model_kwargs = self._expand_inputs_for_generation(
2201 input_ids=input_ids,
2202 expand_size=generation_config.num_return_sequences,
2203 is_encoder_decoder=self.config.is_encoder_decoder,
2204 **model_kwargs,
2205 )
2207 # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2208 result = self._sample(
2209 input_ids,
2210 logits_processor=prepared_logits_processor,
2211 stopping_criteria=prepared_stopping_criteria,
2212 generation_config=generation_config,
2213 synced_gpus=synced_gpus,
2214 streamer=streamer,
2215 **model_kwargs,
2216 )
2218 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
2219 # 11. prepare beam search scorer
2220 beam_scorer = BeamSearchScorer(
2221 batch_size=batch_size,
2222 num_beams=generation_config.num_beams,
(...)
2227 max_length=generation_config.max_length,
2228 )
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:3176, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
3173 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
3175 # forward pass to get next token
-> 3176 outputs = self(**model_inputs, return_dict=True)
3178 # synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping
3179 model_kwargs = self._update_model_kwargs_for_generation(
3180 outputs,
3181 model_kwargs,
3182 is_encoder_decoder=self.config.is_encoder_decoder,
3183 )
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
168 output = module._old_forward(*args, **kwargs)
169 else:
--> 170 output = module._old_forward(*args, **kwargs)
171 return module._hf_hook.post_forward(module, output)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:2138, in MllamaForConditionalGeneration.forward(self, input_ids, pixel_values, aspect_ratio_mask, aspect_ratio_ids, attention_mask, cross_attention_mask, cross_attention_states, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
2135 cross_attention_mask = cross_attention_mask[:, :, cache_position]
2136 full_text_row_masked_out_mask = full_text_row_masked_out_mask[:, :, cache_position]
-> 2138 outputs = self.language_model(
2139 input_ids=input_ids,
2140 attention_mask=attention_mask,
2141 position_ids=position_ids,
2142 cross_attention_states=cross_attention_states,
2143 cross_attention_mask=cross_attention_mask,
2144 full_text_row_masked_out_mask=full_text_row_masked_out_mask,
2145 past_key_values=past_key_values,
2146 use_cache=use_cache,
2147 inputs_embeds=inputs_embeds,
2148 labels=labels,
2149 output_hidden_states=output_hidden_states,
2150 output_attentions=output_attentions,
2151 return_dict=return_dict,
2152 cache_position=cache_position,
2153 num_logits_to_keep=num_logits_to_keep,
2154 )
2156 return outputs
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:1948, in MllamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, cross_attention_states, cross_attention_mask, full_text_row_masked_out_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
1931 outputs = self.model(
1932 input_ids=input_ids,
1933 cross_attention_states=cross_attention_states,
(...)
1944 cache_position=cache_position,
1945 )
1947 hidden_states = outputs[0]
-> 1948 logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
1950 loss = None
1951 if labels is not None:
1952 # Upcast to float if we need to compute the loss to avoid potential precision issues
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
1551 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1552 else:
-> 1553 return self._call_impl(*args, **kwargs)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
1557 # If we don't have any hooks, we want to skip the rest of the logic in
1558 # this function, and just call forward.
1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1560 or _global_backward_pre_hooks or _global_backward_hooks
1561 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562 return forward_call(*args, **kwargs)
1564 try:
1565 result = None
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
164 def new_forward(module, *args, **kwargs):
--> 165 args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
166 if module._hf_hook.no_grad:
167 with torch.no_grad():
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:355, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
347 if (
348 value is not None
349 and self.tied_params_map is not None
350 and value.data_ptr() in self.tied_params_map
351 and self.execution_device not in self.tied_params_map[value.data_ptr()]
352 ):
353 self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
--> 355 set_module_tensor_to_device(
356 module,
357 name,
358 self.execution_device,
359 value=value,
360 fp16_statistics=fp16_statistics,
361 tied_params_map=self.tied_params_map,
362 )
364 return send_to_device(args, self.execution_device), send_to_device(
365 kwargs, self.execution_device, skip_keys=self.skip_keys
366 )
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py:329, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
327 module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
328 elif isinstance(value, torch.Tensor):
--> 329 new_value = value.to(device)
330 else:
331 new_value = torch.tensor(value, device=device)
OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 79.10 GiB of which 2.12 GiB is free. Including non-PyTorch memory, this process has 76.97 GiB memory in use. Of the allocated memory 75.56 GiB is allocated by PyTorch, and 761.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Thanks for the feedback @knkski! Although it's not really an objective of this pipeline, I think we can try to add support and raise a warning at least, wdyt @Rocketknight1 ? For the memory problem, that's is strange indeed, I will look into that, and if others have an idea of why this is happening feel free to chime in. Do you manage to use this model on your setup without using the pipeline?
@yonigozlan I think that's okay! It might result in a bit of crossover with text-generation pipelines, but I think it's fine, and we can deprecate it later and officially move that functionality to text-generation if it's a problem.
@Rocketknight1 @knkski , text-only inference should be supported now :)
@yonigozlan Thanks! Works great for me :rocket:
I think the extra memory usage is unrelated to this PR, so ignore that :+1:
That might mean we should take these changes and fold them into text-generation instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?
X-posting the slack thread (private) about that convo.
IMO better to have both text-generation and image-text-to-text to be consistent with https://huggingface.co/tasks.
There is still some issues with pipeline tests:
- It seems that pipeline model tests are based on "tiny models" available on
hf-internal-testing, but those tiny models don't seem to be added anymore for recent vlms, so they are not being tested. I'm not sure if this is or used to be an automatic or manual process, and if we should start adding those tiny models back again. - The Kosmos2 tiny model causes some problems: it's configuration has hyper-parameters that are not compatible with each other, namely
latent_query_num=3, which is a model parameter, should be the same asnum_image_tokens=64, which is a processor call argument, so can't be set via a json config file (I think?). An easy fix would be to manually changelatent_query_numto 64 in the tiny model's config inhf-internal-testing, but that could make the model not so tiny anymore. Or we could skip the test altogether.
@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!
@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to
hf-internal-testing. You can ask to be added to the organization if you don't have permissions!
I see, thanks for the explanation! As for adding new tiny model, pipelines use the tiny_model_summary.json file to identify tiny models, but it looks like only one tiny model per model architecture can be present in that file, so I'm not sure how to solve the issue with the Kosmos2 tiny model without modifying the current one.
@yonigozlan probably the easiest thing to do, in that case, is just to manually upload a new model, don't add it to tiny_model_summary, and manually set that model in the image-text-to-text tests. You shouldn't need to worry about whatever's in tiny_model_summary.json either way!
Also, I was wrong - some of the tiny models are automatically created, but in this case I think a manual one just for your pipeline will work a lot better.
@ArthurZucker Addressed your comments! I tried to simplify the code a bit without removing any logic. Happy to remove some logic depending on what you think about this https://github.com/huggingface/transformers/pull/34170#discussion_r1816953666
Answered on the issue, I think we should simnplify to ship with not checks, add checks in the corresponding processors
Clearly agree that we should rely on the processors. I can make all this logic into a utils or a ProcessorMixin function and adapt it a bit for models requiring some specific input formatiing. Happy to work on that. I do think however this will take quite a bit of time to validate and make sure we handle BC correctly for all processors, and it seems that there's a need for this pipeline to be shipped quickly (cc @NielsRogge on that). So I feel we can either:
- Ship the pipeline with no checks and the processors as they are now, with an added warning that each model may have some specific input requirements. Progressively improve the processors afterwards. (That's maybe what you're saying? Just want to confirm)
- Keep the checks in the pipeline for now, remove them once the processors are in a good state.
- Modify the processors first, delay shipping this pipeline.
@ArthurZucker I removed most of the preprocessing logic, hopefully the code should be less messy now :)
Ship the pipeline with no checks and the processors as they are now, with an added warning that each model may have some specific input requirements. Progressively improve the processors afterwards. (That's maybe what you're saying? Just want to confirm)
this is IMO the best
Not sure why I didn't see this failure on CircleCI, but running on daily CI runner, I got
FAILED tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_pipeline_image_text_to_text - TypeError: 'int' object is not subscriptable
OH I see. @yonigozlan , better to rebase (or merge, whatever you prefer) the main branch to have #34391
Not sure why I didn't see this failure on CircleCI, but running on daily CI runner, I got
FAILED tests/models/fuyu/test_modeling_fuyu.py::FuyuModelTest::test_pipeline_image_text_to_text - TypeError: 'int' object is not subscriptable
Thanks! Should be fixed now
@ArthurZucker There were some issues with other multimodal pipelines tests where model tests weren't actually being performed previously. I added the tests and made some modifications to owlv2 and fuyu image processors as some backward compatibility wasn't properly handled. This seems to be resolved now, and all tests are passing :)
Thanks for all of your inputs! I'll merged this now as the remaining issues/improvements raised seem a bit out of scope for this PR. Just to recap some of the points that were raised:
- VLMs processors are not fully consistent in terms of what inputs they accept, and some of them don't catch errors that should be caught. Improvements can be made there that would benefit this pipeline as well. I'll open an issue for this to share it as a known limitation, and I'll start working on it asap :).
- Donut doesn't work in this pipeline as processors are not infer in pipelines if they are not in auto.
- Chat templates could be applied directly in conversational models' processor instead of users having to manually do so before making a processor call? Chat inputs could be detected as they are list of dicts.
- Several pipelines have a way to handle detecting inputs in generated text, and removing or adding it. This could be unified in a util, or in generate with an added "return_input" flag.
- Most recent models (and vlms in particular) don't have a "tiny" version uploaded on hf-internal-testing, which means they are not tested by the CI in the different pipelines that support them.