transformers Can not reproduce Blip2ForImageTextRetrieval example from docs, getting different results

System Info

transformers version: 4.52.4
Platform: Linux-4.4.0-x86_64-with-glibc2.36
Python version: 3.12.6
Huggingface_hub version: 0.32.3
Safetensors version: 0.5.3
Accelerate version: not installed
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.7.0+cu126 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: Tesla T4

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm trying to run Blip2ForImageTextRetrieval on modal infrastructure, and it produces very inaccurate results. In short, when I would expect to see a high "is" score, I get either very low or, at best, close to 0.5.

To debug, I tried to reproduce the example from docs

import modal

app = modal.App(name="blip-itm")

image = (modal.Image.debian_slim()
    .pip_install("torch", "transformers", "pillow", "requests")
)

@app.function(image=image, gpu="T4")
def official_demo(self):
    import torch
    from PIL import Image
    import requests
    from transformers import AutoProcessor, Blip2ForImageTextRetrieval

    device = "cuda" if torch.cuda.is_available() else "cpu"

    model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
    processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")

    model.to(device)
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw)
    text = "two cats laying on a pink blanket"

    inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
    with torch.cuda.amp.autocast():
        itm_out = model(**inputs, use_image_text_matching_head=True)
    logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
    probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

    print(f"{probs[0][0]:.1%} that image 0 is not '{text}'")

    print(f"{probs[0][1]:.1%} that image 0 is '{text}'")

    texts = ["a photo of a cat", "a photo of a dog"]

    inputs = processor(images=image, text=texts, return_tensors="pt").to(device, torch.float16)
    with torch.cuda.amp.autocast():
        itc_out = model(**inputs, use_image_text_matching_head=False)
    logits_per_image = itc_out.logits_per_image  # this is the image-text similarity score
    probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities

    print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")

    print(f"{probs[0][1]:.1%} that image 0 is '{texts[1]}'")


@app.local_entrypoint()
def main():
     official_demo.remote()

However the output is

49.1% that image 0 is not 'two cats laying on a pink blanket'
50.9% that image 0 is 'two cats laying on a pink blanket'
49.9% that image 0 is 'a photo of a cat'
50.1% that image 0 is 'a photo of a dog'

Which is inaccurate and way of from what docs example state.

Also, I'm getting

RuntimeError: expected scalar type Half but found Float

but I resolved it by explicitly autocasting (it's the only difference in my code from the docs example)

Expected behavior

The output when running official docs sample code should be

26.9% that image 0 is not 'two cats laying on a pink blanket'
73.0% that image 0 is 'two cats laying on a pink blanket'
55.3% that image 0 is 'a photo of a cat'
44.7% that image 0 is 'a photo of a dog'

Jun 01 '25 09:06 KarlisJ

Probably will be fixed by https://github.com/huggingface/transformers/pull/38510, there is a dtype mismatch due to keeping some modules in fp32

Jun 02 '25 11:06 zucchini-nlp

Probably will be fixed by #38510, there is a dtype mismatch due to keeping some modules in fp32可能会由 #38510 修复，由于在 fp32 中保留某些模块，因此存在 dtype 不匹配

Hi, I applied the fix for #38510, but when I run the Blip2ForImageTextRetrieval example, the output is consistent with @KarlisJ , and I cannot reproduce the results.

Jun 14 '25 13:06 azmle112

Which version you have @azmle112 ? Can you try to install from main so all the fixes are pulled, it worked for me in the main with above script

Jun 16 '25 06:06 zucchini-nlp

Which version you have @azmle112 ? Can you try to install from main so all the fixes are pulled, it worked for me in the main with above script

Thanks for your advice! I have now successfully reproduced it

Jun 16 '25 11:06 azmle112

I'm getting different results from Blip2ForConditionalGeneration depending on the Transformers library version. Showing the usual photo of two cats sleeping on a couch, I get the following results:

✅ 4.51.3: "two cats laying on a couch"
❌ 4.52.1: "a"
❌ 4.52.2: "a"
❌ 4.52.3: "a"
❌ 4.52.4: "a woman is standing in front of a pink background"

import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import requests
from PIL import Image

device = "cpu"  # change to "cuda" or "mps" if needed
model_id = "Salesforce/blip2-opt-2.7b"
blip2_processor = Blip2Processor.from_pretrained(model_id)
blip2_model = Blip2ForConditionalGeneration.from_pretrained(
    model_id, device_map=device, torch_dtype=torch.float16)

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"  # two cats
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = blip2_processor(images=image, return_tensors="pt")
inputs = inputs.to(device, dtype=torch.float16)
with torch.no_grad():
    generated_ids = blip2_model.generate(**inputs)

generated_text = blip2_processor.batch_decode(generated_ids,
                                              skip_special_tokens=True)
generated_text

Do you think it's the same issue or should I open a new issue?

Jun 17 '25 02:06 ageron

Here's a gist colab notebook that reproduces the issue.

Jun 17 '25 03:06 ageron

FYI, I just tried to apply the #38510 but it didn't fix the issue.

Jun 17 '25 03:06 ageron

Can you open a new issue @ageron , this one is about dtype mismatch and was resolved. So I am closing it

Jun 17 '25 07:06 zucchini-nlp