llama-models Questions on how to reproduce DocVQA test set of 11B instruct model in meta 3.2 report?

I used the following code to run prediction on DocVQA test set and submit it to its leaderboard website (need registration). I cannot reproduce the 11B instruct's score 88.4 (metric ANLS, default by the leaderboard website) reported in llama 3.2 report. I only got 77.4 ANLS score. Can anyone share insights if my prompt in the following is the right way to generate prediction?

transformers==4.45.1

fine_tuned_path = <PATH TO 11B INSTRUCT>

processor = AutoProcessor.from_pretrained(
    fine_tuned_path,
    trust_remote_code=True
)
model =  MllamaForConditionalGeneration.from_pretrained(
    fine_tuned_path, trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

q, image = inference_ex["question"], inference_ex["image"].convert("RGB")
prompts_complete = [
        {
            'content': [
                {"type": "image"}, 
                {"type": "text", "text": q},
            ], 
            'role': 'user'
        }              
]

prompt = processor.apply_chat_template(prompts_complete, add_generation_prompt=True)  
inputs = processor(text = prompt, images = image, return_tensors="pt")
outputs = model.generate(
    **inputs, do_sample=False, max_new_tokens=25, temperature=0.5
)
res_tmp = processor.decode(outputs[0], skip_special_tokens=True).split("\n\n")[-1]

Oct 04 '24 03:10 Neo9061

I was able to reproduce the test numbers for DocVQA using the prompt in the eval_details.md and using VLMEvalkit

Oct 11 '24 18:10 jamespark3922

Thank you @jamespark3922 for pointing to the right resource. Closing out this task as it has the right information on how to re-produce the results.

Nov 12 '24 19:11 varunfb