LAVIS
LAVIS copied to clipboard
Reproducing InstructBLIP on Flickr30K
Hi,
I'm trying to reproduce the results reported on "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". But, I'm facing difficulty reproducing the InstructBLIP (Vicuna-7B) results on Flickr30K test set for the image captioning task.
I'm using the model from Hugginface and executing the code snippet below, and I'm getting a Cider score of 60.9 while the reported one is 82.4.
I'm using the prompt reported on the paper, "A short image description: ", and decoding hyperparams from the example in huggingface. I wonder if I'm using the correct hyperparams and prompt?
PS: using the same hyperparams and the prompt "A short image caption." increases the cider score to 83.1
model_name = "Salesforce/instructblip-vicuna-7b"
processor = InstructBlipProcessor.from_pretrained(model_name)
model = InstructBlipForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16)
model.to(device)
model.eval()
prompt = ["A short image description: "] * config.batch
transform = lambda img: processor(images=img, text=prompt, return_tensors="pt")
dataset = load_dataset(config=config, transform=transform)
results = []
for batch in tqdm.tqdm(dataset, desc="Inference"):
img_ids, images, _ = batch
inputs = images.to(device)
outputs = model.generate(
**inputs,
do_sample=False,
num_beams=5,
max_length=256,
min_length=1,
top_p=0.9,
repetition_penalty=1.5,
length_penalty=1.0,
temperature=1,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)
Hi,
I'm trying to reproduce the results reported on "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". But, I'm facing difficulty reproducing the InstructBLIP (Vicuna-7B) results on Flickr30K test set for the image captioning task.
I'm using the model from Hugginface and executing the code snippet below, and I'm getting a Cider score of 60.9 while the reported one is 82.4.
I'm using the prompt reported on the paper, "A short image description: ", and decoding hyperparams from the example in huggingface. I wonder if I'm using the correct hyperparams and prompt?
PS: using the same hyperparams and the prompt "A short image caption." increases the cider score to 83.1
model_name = "Salesforce/instructblip-vicuna-7b" processor = InstructBlipProcessor.from_pretrained(model_name) model = InstructBlipForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16) model.to(device) model.eval() prompt = ["A short image description: "] * config.batch transform = lambda img: processor(images=img, text=prompt, return_tensors="pt") dataset = load_dataset(config=config, transform=transform) results = [] for batch in tqdm.tqdm(dataset, desc="Inference"): img_ids, images, _ = batch inputs = images.to(device) outputs = model.generate( **inputs, do_sample=False, num_beams=5, max_length=256, min_length=1, top_p=0.9, repetition_penalty=1.5, length_penalty=1.0, temperature=1, ) generated_text = processor.batch_decode(outputs, skip_special_tokens=True)
Hello,
I would like to know which file I should run to operate InstructBLIP. I can't seem to find the corresponding one, as the structure appears a bit chaotic. I look forward to your response. Thank you!
@sdwulxr I am using the model from huggingface, exactly because I had many problems with this lib and sometimes I get confused about the organization. On the other hand, using huggingface, it's just a plug-and-play.
Hi @gabrielsantosrv could you link the rest of your script?