fromage icon indicating copy to clipboard operation
fromage copied to clipboard

Evaluation code for VQAv2

Open ys-zong opened this issue 1 year ago • 4 comments

Hi, thanks again for the nice work! I was trying to reproduce the experiments in VQAv2 using your pretrained weights and evaluate using this repo mentioned in the paper. However, I only get an accuracy of ~10%. I guess there is something wrong with my code or maybe different prompting can affect the performance. I wonder if you could push the code of VQAv2, which would be very helpful. Many thanks!

Here is a snippet of how I generate the answer:

def generate_answers(questions, model, root_path):
    results = []
    for question in tqdm(questions, desc="Generating captions"):
        img_id = question['image_id']
        img_path = os.path.join(root_path, 'val2014', id_to_imgname(img_id))
        image = utils.get_image_from_path(img_path)
        pixel_values = utils.get_pixel_values_for_model(model.model.feature_extractor, image)
        pixel_values = pixel_values.to(device=model.model.logit_scale.device, dtype=model.model.logit_scale.dtype)
        pixel_values = pixel_values[None, ...]
        imginp = model.model.get_visual_embs(pixel_values, mode='captioning')

        question_text = question['question']
        prompt_text = 'Q: ' + question_text + ' A:'
        input_ids = model.model.tokenizer(prompt_text, add_special_tokens=True, return_tensors="pt").input_ids.to(model.model.logit_scale.device)
        input_text_embedding = model.model.input_embeddings(input_ids)#[0, ...]
        input_embs = torch.cat([imginp, input_text_embedding], dim=1)#[None, ...]
        generated_ids, _, _ = model(
                                input_embs, None, None, generate=True, num_words=15, temperature=0.0, top_p=1.0)
        predicted_answer = model.model.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
        predicted_answer = utils.truncate_caption(predicted_answer).strip()
        question_id = question['question_id']
        results.append({
            "question_id": question_id,
            "answer": predicted_answer
        })
    return results

For prompt, I tried prompt_text = 'Q: ' + question_text + ' A:' and prompt_text = 'Q: ' + question_text + '\nA:'. The former performs slightly better.

ys-zong avatar Jul 29 '23 15:07 ys-zong

Hi, thanks for pointing this out! I realized this wasn't mentioned in the paper (we'll add it to the next arXiv version), but we do the same as MAGMA and "truncate the model output to the length of the longest ground truth answer". Could you try doing this and make sure to cast the outputs to lowercase with .lower()?

Let me know if that works! I'm traveling right now but will upload the VQA eval code used when I'm back.

kohjingyu avatar Aug 01 '23 14:08 kohjingyu

Thanks for your reply! Yes, I have cast all the outputs to lowercase.

"truncate the model output to the length of the longest ground truth answer"

Does the "longest ground truth answer" mean the longest answer of all questions or the longest answer of each question (each question has multiple GT answers)?

ys-zong avatar Aug 01 '23 15:08 ys-zong

It should be the longest answer for each question.

kohjingyu avatar Aug 01 '23 15:08 kohjingyu

Great! Now I can get the Acc of 27.5% after truncation. Thanks a lot for the help! Still would like to check your implementation to see the last minor difference (but absolutely no hurries).

ys-zong avatar Aug 01 '23 15:08 ys-zong