transformers Cannot reproduce results for Pix2struct on InfographicVQA

Cannot reproduce results for Pix2struct on InfographicVQA

Open Lizw14 opened this issue 1 year ago • 3 comments

I am using the pix2struct-infographics-vqa-base and pix2struct-infographics-vqa-large model here and doing inference on InfographicsVQA. However, I get 29.53 ANLS for base and 34.31 ANLS for large, which do not match with the 38.2 and 40.0 results as in the original paper. Could anyone help with this?

Here is my inference code:

import requests
from PIL import Image
import torch
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor

model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-infographics-vqa-base").to("cuda")
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-infographics-vqa-base")

image_url = "https://blogs.constantcontact.com/wp-content/uploads/2019/03/Social-Media-Infographic.png"
image = Image.open(requests.get(image_url, stream=True).raw)
question = "Which social platform has heavy female audience?"
inputs = processor(images=image, text=question, return_tensors="pt").to("cuda")

predictions = model.generate(**inputs)
pred = processor.decode(predictions[0], skip_special_tokens=True)
gt = 'pinterest'

print(pred)

May 30 '23 22:05 Lizw14

cc @younesbelkada

May 31 '23 06:05 NielsRogge

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 30 '23 15:06 github-actions[bot]

gentle ping @younesbelkada

Jun 30 '23 15:06 amyeroberts

Hi everyone, Sadly I won't have the bandwidth to properly dig into this right now, @Lizw14 do you still face the same issue when using the main branch of transformers?

pip install git+https://github.com/huggingface/transformers.git

Jul 06 '23 07:07 younesbelkada

@Lizw14 quickly going back to the issue, can you double check you used the same hyper parameters than the ones presented on the paper? for example what is the sequence length you are using? in what precision do you load the model (fp32, fp16, bf16, int8)? Ideally can you share the full script you use to reproduce the results of the paper Thanks!

Jul 07 '23 07:07 younesbelkada

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 01 '23 15:08 github-actions[bot]

transformers transformers copied to clipboard

Cannot reproduce results for Pix2struct on InfographicVQA

transformers
transformers copied to clipboard