LAVIS Reproducing InstructBLIP on Flickr30K

Hi,

I'm trying to reproduce the results reported on "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". But, I'm facing difficulty reproducing the InstructBLIP (Vicuna-7B) results on Flickr30K test set for the image captioning task.

I'm using the model from Hugginface and executing the code snippet below, and I'm getting a Cider score of 60.9 while the reported one is 82.4.

I'm using the prompt reported on the paper, "A short image description: ", and decoding hyperparams from the example in huggingface. I wonder if I'm using the correct hyperparams and prompt?

PS: using the same hyperparams and the prompt "A short image caption." increases the cider score to 83.1

    model_name = "Salesforce/instructblip-vicuna-7b"
    processor = InstructBlipProcessor.from_pretrained(model_name)
    model = InstructBlipForConditionalGeneration.from_pretrained(model_name,  torch_dtype=torch.float16)
    model.to(device)
    model.eval()

    prompt = ["A short image description: "] * config.batch
    transform = lambda img: processor(images=img, text=prompt, return_tensors="pt")    
    dataset = load_dataset(config=config, transform=transform)
    
    results = []
    for batch in tqdm.tqdm(dataset, desc="Inference"):
        img_ids, images, _ = batch
        inputs = images.to(device)
        outputs = model.generate(
            **inputs,
            do_sample=False,
            num_beams=5,
            max_length=256,
            min_length=1,
            top_p=0.9,
            repetition_penalty=1.5,
            length_penalty=1.0,
            temperature=1,
        )
        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)

Jun 25 '24 17:06 gabrielsantosrv

Hi,

I'm trying to reproduce the results reported on "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning". But, I'm facing difficulty reproducing the InstructBLIP (Vicuna-7B) results on Flickr30K test set for the image captioning task.

I'm using the model from Hugginface and executing the code snippet below, and I'm getting a Cider score of 60.9 while the reported one is 82.4.

I'm using the prompt reported on the paper, "A short image description: ", and decoding hyperparams from the example in huggingface. I wonder if I'm using the correct hyperparams and prompt?

PS: using the same hyperparams and the prompt "A short image caption." increases the cider score to 83.1
    model_name = "Salesforce/instructblip-vicuna-7b"
    processor = InstructBlipProcessor.from_pretrained(model_name)
    model = InstructBlipForConditionalGeneration.from_pretrained(model_name,  torch_dtype=torch.float16)
    model.to(device)
    model.eval()

    prompt = ["A short image description: "] * config.batch
    transform = lambda img: processor(images=img, text=prompt, return_tensors="pt")    
    dataset = load_dataset(config=config, transform=transform)
    
    results = []
    for batch in tqdm.tqdm(dataset, desc="Inference"):
        img_ids, images, _ = batch
        inputs = images.to(device)
        outputs = model.generate(
            **inputs,
            do_sample=False,
            num_beams=5,
            max_length=256,
            min_length=1,
            top_p=0.9,
            repetition_penalty=1.5,
            length_penalty=1.0,
            temperature=1,
        )
        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)

Hello,

I would like to know which file I should run to operate InstructBLIP. I can't seem to find the corresponding one, as the structure appears a bit chaotic. I look forward to your response. Thank you!

Sep 12 '24 08:09 sdwulxr

@sdwulxr I am using the model from huggingface, exactly because I had many problems with this lib and sometimes I get confused about the organization. On the other hand, using huggingface, it's just a plug-and-play.

Nov 07 '24 21:11 gabrielsantosrv

Hi @gabrielsantosrv could you link the rest of your script?

Apr 12 '25 13:04 matejgrcic

LAVIS LAVIS copied to clipboard

Reproducing InstructBLIP on Flickr30K

LAVIS
LAVIS copied to clipboard