donut Output doesn't make sense

I tested this image (To mimic the blurred/mosaic behavior of the CORD dataset, I cropped the header and this is the resulting image) Here is the code I used to inference

from donut import DonutModel
from PIL import Image
import torch
import os
import json

if __name__ == "__main__":
    ckpt = "donut-base-finetuned-cord-v2"
    task_prompt = "<s_cord-v2>"
    output_dir = "data/no_header_output"
    images = "data/no_header"
    img_extension = ".JPG"
    
    model = DonutModel.from_pretrained(f"naver-clova-ix/{ckpt}")
    if torch.cuda.is_available():
        model.half()
        device = torch.device("cuda") 
        model.to(device) 
    else: 
        model.encoder.to(torch.bfloat16)
    model.eval()

    for root, dirs, files in os.walk(images):
        for image in files:
            if image.endswith(img_extension):
                input_img = Image.open(os.path.join(root, image))
                output = model.inference(image=input_img, prompt=task_prompt)
                output_dict = f"{output_dir}/{ckpt}-{task_prompt}-{os.path.basename(image)}.json"
                with open(output_dict, "w", encoding="utf-8") as f:
                    json.dump(output, f, indent=4, sort_keys=True, ensure_ascii=False)

The output looks like

{
    "predictions": [
        {
            "text_sequence": " 6 9 9 9 9 9 9 9 9 R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R A R R S 26.96 S R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R"
        }
    ]
}

Which does not make sense at all. I also tested on a bunch of other real-world receipts and the results are rarely accurate

I wonder if there's anything I did wrong?

May 26 '23 18:05 Hegelim

You might have to finetune a model for your specific use case

May 29 '23 07:05 TheSeriousProgrammer

Have you tried blurring the top & bottom instead of cropping them? I think it would produce better results.

May 31 '23 10:05 thinh-huynh-re

thanks @TheSeriousProgrammer and @thinh-huynh-re for your replies. It turned out that the real issue is #192. After I downgraded both timm and transformers versions, the results made much more sense. I also highly suggest the developers team can make adjustments based on this issue.

May 31 '23 16:05 Hegelim

donut donut copied to clipboard

Output doesn't make sense

donut
donut copied to clipboard