LAVIS icon indicating copy to clipboard operation
LAVIS copied to clipboard

ocr training/evaluation of instructblip

Open gyhdog99 opened this issue 1 year ago • 1 comments

Dear Maintainers,

I'm currently trying to reproduce the zero-shot results of instructblip. The caption of table5 says that for datasets with OCR tokens, the image query embeddings are simply appended with the phrase “OCR tokens:”. While examining the datasets, I noticed that OCR-VQA was utilized during the instruction tuning process. However, I was unable to locate the mentioned OCR tokens within the downloaded dataset. Could you kindly provide some guidance or point me in the right direction on how to access or extract these OCR tokens?

Additionally, during the zero-shot evaluation of Text-VQA, I have found that using instructblip-t5-xl yields superior results (around 55% acc.). I suspect this may be due to potential differences in the evaluation methods implemented in my code. Therefore, I'd be grateful if you could provide some insights into your evaluation process for Text-VQA. To be more specific, I'm interested in how you handled the post-processing of predictions as well as ground truths. In particular, I would appreciate if you could shed some light on how you implemented the _report_metrics method within the Text-VQA task.

I am looking forward to your response and would like to thank you in advance for your time and support.

Thanks!

gyhdog99 avatar Jul 18 '23 08:07 gyhdog99

@gyhdog99 Hello, I am also trying to use TextVQA to evaluate instructblip, but I did not successfully run the results. Can you provide your evaluation script? Thank you sincerely!

Fym68 avatar Mar 05 '24 11:03 Fym68