Questions on how to reproduce DocVQA test set of 11B instruct model in meta 3.2 report?
I used the following code to run prediction on DocVQA test set and submit it to its leaderboard website (need registration). I cannot reproduce the 11B instruct's score 88.4 (metric ANLS, default by the leaderboard website) reported in llama 3.2 report. I only got 77.4 ANLS score. Can anyone share insights if my prompt in the following is the right way to generate prediction?
transformers==4.45.1
fine_tuned_path = <PATH TO 11B INSTRUCT>
processor = AutoProcessor.from_pretrained(
fine_tuned_path,
trust_remote_code=True
)
model = MllamaForConditionalGeneration.from_pretrained(
fine_tuned_path, trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16
)
q, image = inference_ex["question"], inference_ex["image"].convert("RGB")
prompts_complete = [
{
'content': [
{"type": "image"},
{"type": "text", "text": q},
],
'role': 'user'
}
]
prompt = processor.apply_chat_template(prompts_complete, add_generation_prompt=True)
inputs = processor(text = prompt, images = image, return_tensors="pt")
outputs = model.generate(
**inputs, do_sample=False, max_new_tokens=25, temperature=0.5
)
res_tmp = processor.decode(outputs[0], skip_special_tokens=True).split("\n\n")[-1]
I was able to reproduce the test numbers for DocVQA using the prompt in the eval_details.md and using VLMEvalkit
Thank you @jamespark3922 for pointing to the right resource. Closing out this task as it has the right information on how to re-produce the results.