Can not reproduce Blip2ForImageTextRetrieval example from docs, getting different results
System Info
transformersversion: 4.52.4- Platform: Linux-4.4.0-x86_64-with-glibc2.36
- Python version: 3.12.6
- Huggingface_hub version: 0.32.3
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.7.0+cu126 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: Tesla T4
Who can help?
No response
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I'm trying to run Blip2ForImageTextRetrieval on modal infrastructure, and it produces very inaccurate results. In short, when I would expect to see a high "is" score, I get either very low or, at best, close to 0.5.
To debug, I tried to reproduce the example from docs
import modal
app = modal.App(name="blip-itm")
image = (modal.Image.debian_slim()
.pip_install("torch", "transformers", "pillow", "requests")
)
@app.function(image=image, gpu="T4")
def official_demo(self):
import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")
model.to(device)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"
inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
with torch.cuda.amp.autocast():
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print(f"{probs[0][0]:.1%} that image 0 is not '{text}'")
print(f"{probs[0][1]:.1%} that image 0 is '{text}'")
texts = ["a photo of a cat", "a photo of a dog"]
inputs = processor(images=image, text=texts, return_tensors="pt").to(device, torch.float16)
with torch.cuda.amp.autocast():
itc_out = model(**inputs, use_image_text_matching_head=False)
logits_per_image = itc_out.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")
print(f"{probs[0][1]:.1%} that image 0 is '{texts[1]}'")
@app.local_entrypoint()
def main():
official_demo.remote()
However the output is
49.1% that image 0 is not 'two cats laying on a pink blanket'
50.9% that image 0 is 'two cats laying on a pink blanket'
49.9% that image 0 is 'a photo of a cat'
50.1% that image 0 is 'a photo of a dog'
Which is inaccurate and way of from what docs example state.
Also, I'm getting
RuntimeError: expected scalar type Half but found Float
but I resolved it by explicitly autocasting (it's the only difference in my code from the docs example)
Expected behavior
The output when running official docs sample code should be
26.9% that image 0 is not 'two cats laying on a pink blanket'
73.0% that image 0 is 'two cats laying on a pink blanket'
55.3% that image 0 is 'a photo of a cat'
44.7% that image 0 is 'a photo of a dog'
Probably will be fixed by https://github.com/huggingface/transformers/pull/38510, there is a dtype mismatch due to keeping some modules in fp32
Probably will be fixed by #38510, there is a dtype mismatch due to keeping some modules in fp32可能会由 #38510 修复,由于在 fp32 中保留某些模块,因此存在 dtype 不匹配
Hi, I applied the fix for #38510, but when I run the Blip2ForImageTextRetrieval example, the output is consistent with @KarlisJ , and I cannot reproduce the results.
Which version you have @azmle112 ? Can you try to install from main so all the fixes are pulled, it worked for me in the main with above script
Which version you have @azmle112 ? Can you try to install from
mainso all the fixes are pulled, it worked for me in themainwith above script
Thanks for your advice! I have now successfully reproduced it
I'm getting different results from Blip2ForConditionalGeneration depending on the Transformers library version. Showing the usual photo of two cats sleeping on a couch, I get the following results:
- ✅ 4.51.3: "two cats laying on a couch"
- ❌ 4.52.1: "a"
- ❌ 4.52.2: "a"
- ❌ 4.52.3: "a"
- ❌ 4.52.4: "a woman is standing in front of a pink background"
import torch
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import requests
from PIL import Image
device = "cpu" # change to "cuda" or "mps" if needed
model_id = "Salesforce/blip2-opt-2.7b"
blip2_processor = Blip2Processor.from_pretrained(model_id)
blip2_model = Blip2ForConditionalGeneration.from_pretrained(
model_id, device_map=device, torch_dtype=torch.float16)
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg" # two cats
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = blip2_processor(images=image, return_tensors="pt")
inputs = inputs.to(device, dtype=torch.float16)
with torch.no_grad():
generated_ids = blip2_model.generate(**inputs)
generated_text = blip2_processor.batch_decode(generated_ids,
skip_special_tokens=True)
generated_text
Do you think it's the same issue or should I open a new issue?
Here's a gist colab notebook that reproduces the issue.
FYI, I just tried to apply the #38510 but it didn't fix the issue.
Can you open a new issue @ageron , this one is about dtype mismatch and was resolved. So I am closing it