transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[BLIP-2] BitsAndBytes 4 and 8 bit give empty string

Open NielsRogge opened this issue 1 year ago • 10 comments
trafficstars

System Info

Transformers v4.40.dev

Who can help?

@younesbelkada

Reproduction

As reported here: https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/26, the 4 and 8 bit versions of BLIP-2 return an empty string (or only special tokens) when decoding.

Here's how to reproduce:

from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_4bit=True,device_map="auto")

raw_image = Image.open("01256.png").convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Expected behavior

Should return an answer similar to full/half precision

NielsRogge avatar Apr 22 '24 08:04 NielsRogge

hi @NielsRogge Running:

import requests
import torch

from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_4bit=True, device_map={"": 0})

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Gives me correctly:

</s>a woman sitting on the beach with a dog

On transformers main + latest bitsandbytes, can you try to run that script and let me know what do you get?

younesbelkada avatar Apr 22 '24 16:04 younesbelkada

Hello @younesbelkada, I met a similar problem. I tried your code, it outputs the same texts. But when I try VQA, it outputs nothing again. I've only added one question to the processor as below.

import requests
import torch

from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

cache_dir = "/p/yufeng/.cache"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b",cache_dir=cache_dir)
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", cache_dir=cache_dir,
                                                     load_in_4bit=True, device_map={"": 0})

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "What is in the picture?"

inputs = processor(raw_image, text=question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Could you take a look at this problem? I truly appreciate it.

chrisgao99 avatar Apr 22 '24 18:04 chrisgao99

Yes sorry I linked the wrong code snippet, you get an empty response when passing a text:

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

NielsRogge avatar Apr 22 '24 18:04 NielsRogge

And if I print the out,

out = model.generate(**inputs)
print(out)

I always get the same tokens

tensor([[    2, 50118]], device='cuda:0')

no matter what image or text I input. So I feel the model can't process the inputs.

chrisgao99 avatar Apr 22 '24 21:04 chrisgao99

Hi @NielsRogge @chrisgao99 I just ran:

import requests
import torch

from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map={"": 0})

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

and got:

</s>a woman sitting on the beach with a dog

With latest bnb, I am using a NVIDIA A100 GPU

younesbelkada avatar Apr 23 '24 09:04 younesbelkada

@younesbelkada yes that's because you're not passing a text to the processor, hence no text is being passed to the model. The bug only happens when passing an image + text

NielsRogge avatar Apr 23 '24 19:04 NielsRogge

Hi all,

I can reproduce this too. But if I start making some changes to the text, or the generation strategy, I do start to get other results.

Default question = "how many dogs are in the picture?" output: [2, 50118] = ""

Prompt Change question = "how many dogs are in the picture? answer:" output: [2, 112, 50118] = " 1"

min_length=10 question = "how many dogs are in the picture?" output: [2, 111, 2335, 1058, 50118] = " - dog training"

matthewdouglas avatar Apr 25 '24 17:04 matthewdouglas

Hi all, I think the prompt template for VQA is as below:

question = "Question: how many dogs are in the picture? Answer:"

See the current documentation here.

ecekt avatar Apr 29 '24 21:04 ecekt

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 24 '24 08:05 github-actions[bot]

Gently pinging @younesbelkada here

NielsRogge avatar May 24 '24 08:05 NielsRogge

Hi all, I think the prompt template for VQA is as below:

question = "Question: how many dogs are in the picture? Answer:"

See the current documentation here.

@NielsRogge @younesbelkada @ecekt I also got an empty response using fp16 at first. Using the template, it works fine though. The model seems to be sensitive to capitals in the template as well:

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map=0)

def ask_question(prompt):
    inputs = processor(raw_image, prompt, return_tensors="pt").to("cuda:0", torch.float16)
    out = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(out[0], skip_special_tokens=True).strip()
ask_question("Is there a woman in this picture?") -> ''
ask_question("Question: Is there a woman in this picture? Answer:") -> 'Yes, there is a woman in this picture.'
ask_question("Question: Is there a woman in this picture? **a**nswer:") -> 'no, there is no woman in this picture'

tlpss avatar May 27 '24 12:05 tlpss

Hi everyone As pointed out by @tlpss & @ecekt - I don't think there is an issue here, I didn't flagged any regression between transformers versions, I was able to reproduce the empty string issue across transformers == 4.30.0 and 4.41.0 Make sure to follow the correct VQA format when prompting BLIP2 for visual question answering

younesbelkada avatar May 27 '24 13:05 younesbelkada

Closing the issue !

younesbelkada avatar May 27 '24 13:05 younesbelkada