donut
donut copied to clipboard
Output doesn't make sense
I tested this image
(To mimic the blurred/mosaic behavior of the CORD dataset, I cropped the header and this is the resulting image)
Here is the code I used to inference
from donut import DonutModel
from PIL import Image
import torch
import os
import json
if __name__ == "__main__":
ckpt = "donut-base-finetuned-cord-v2"
task_prompt = "<s_cord-v2>"
output_dir = "data/no_header_output"
images = "data/no_header"
img_extension = ".JPG"
model = DonutModel.from_pretrained(f"naver-clova-ix/{ckpt}")
if torch.cuda.is_available():
model.half()
device = torch.device("cuda")
model.to(device)
else:
model.encoder.to(torch.bfloat16)
model.eval()
for root, dirs, files in os.walk(images):
for image in files:
if image.endswith(img_extension):
input_img = Image.open(os.path.join(root, image))
output = model.inference(image=input_img, prompt=task_prompt)
output_dict = f"{output_dir}/{ckpt}-{task_prompt}-{os.path.basename(image)}.json"
with open(output_dict, "w", encoding="utf-8") as f:
json.dump(output, f, indent=4, sort_keys=True, ensure_ascii=False)
The output looks like
{
"predictions": [
{
"text_sequence": " 6 9 9 9 9 9 9 9 9 R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R A R R S 26.96 S R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R"
}
]
}
Which does not make sense at all. I also tested on a bunch of other real-world receipts and the results are rarely accurate
I wonder if there's anything I did wrong?
You might have to finetune a model for your specific use case
Have you tried blurring the top & bottom instead of cropping them? I think it would produce better results.
thanks @TheSeriousProgrammer and @thinh-huynh-re for your replies. It turned out that the real issue is #192. After I downgraded both timm and transformers versions, the results made much more sense. I also highly suggest the developers team can make adjustments based on this issue.