donut Add bounding boxes coordinates in predictions

It could be useful to get bounding boxes coordinates from Document Information Extraction task predictions.

on conventional pipeline : Screenshot from 2022-09-05 06-33-35

on Donut it could be something like:

{
    'predictions': [{
        'menu': [{
                'cnt': '2',
                'nm': 'ICE BLAOKCOFFE',
                'price': '82,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
            {
                'cnt': '1',
                'nm': 'AVOCADO COFFEE',
                'price': '61,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
        ],
        'total': {
            'cashprice': '200,000',
            'changeprice': '25,400',
            'total_price': '174,600',
            'bbox': [xmin, ymin, xmax, ymax]
        }
    }]
}

possible solution (I did not succeed): https://github.com/clovaai/donut/issues/16#issuecomment-1217464215

Sep 05 '22 04:09 underthesand

I come from #16 and I think that the way mentioned there could be a huge step towards this issue but sadly the issue has been closed. @gwkrsrch is there any chance you can share the code used to transform the outputs from https://github.com/clovaai/donut/blob/1.0.5/donut/model.py#L492 to the heatmaps shown on figure 8 of the paper? that would be incredibly helpful.

thanks! :)

Sep 07 '22 14:09 ilcarbo

Screen Shot 2022-09-09 at 10 55 53 AM

So, I've found a way to generate the heatmaps from the cross attentions from the decoder. However, the attention maps correspond to each output token from the decoder and not necessarily a word i.e. the word Restaurant might consist of three tokens (Res + tau + rant) and the attention-heatmaps are very coarse and might not give precise boxes as shown in the example.

Additionally, you need to get the correspondence between the token values and the token indices and have to snoop in the transformers library Bart batch decode implementations for that.

In the example above, I've fused the attention heads, the layer heads, and the different token heatmaps with max fusion. And run a threshold on the attention areas, contour them and save the bounding-box with the largest area. Maybe someone can find a way to generate better heat maps.

I'll attach the link to the notebook I used to generate the maps. If people are interested in the code to get the token indexes to token values mapping, I can attach a modified donut/model.py as well.

https://colab.research.google.com/drive/1OzRapy23W8Ksf0AtqlkLFaVAAjJRUqbk?usp=sharing

Sep 09 '22 07:09 SamSamhuns

@SamSamhuns thanks for sharing the colab can we draw boxes on <s_docvqa> ?

Sep 09 '22 22:09 qustions

It should work for docvqa as well. I've added an example in the notebook I shared above based on the code snippet below. The attention is focused on the answer from the image. But the bounding boxes from the attention maps are not very precise

question = "What is the total?"
img_path = "image.jpg"

task_prompt = f"<s_docvqa><s_question>{question}</s_question><s_answer>"
image = Image.open(img_path).convert("RGB")
output = model.inference(image=image, prompt=task_prompt, return_attentions=True)

# output["predictions"] = [{'question': 'What is the total?', 'answer': '96,000'}]

download

Sep 10 '22 14:09 SamSamhuns

I am getting shape error and I am using the naver-clova-ix/donut-base-finetuned-docvqa model when i ran

import torch
from PIL import Image
from donut import DonutModel

from pprint import pprint as pp


model = DonutModel.from_pretrained(
    "naver-clova-ix/donut-base-finetuned-docvqa")

if torch.cuda.is_available():
    model.half()
    device = torch.device("cuda")
    model.to(device)
else:
    model.encoder.to(torch.bfloat16)
model.eval()

question = "What is the total?"
img_path = "image.jpg"

task_prompt = f"<s_docvqa><s_question>{question}</s_question><s_answer>"
image = Image.open(img_path).convert("RGB")
output = model.inference(image=image, prompt=task_prompt, return_attentions=True)

rest of the bonding box code and after that this shape error RuntimeError: shape '[4, 16, 40, 30]' is invalid for input of size 307200

Sep 10 '22 18:09 qustions

Refer to the Document VQA Example section from this notebook. You have to use a resized shape of [4, 16, 80, 60] for docvqa task since the final cross-attention feature map sizes differ from the document extraction task.

https://colab.research.google.com/drive/1OzRapy23W8Ksf0AtqlkLFaVAAjJRUqbk?usp=sharing

Sep 11 '22 06:09 SamSamhuns

@SamSamhuns thanks for sharing this notebook its working as expected but when giving different images for example image size (919, 998) output["predictions"] = [{'question': 'What is the Name?', 'answer': 'nice client'}] the output of the heatmap showing on address not on the name. I tried to change image resolution in the colab by using image =image.resize((960,1280)) This image is without resize out with resized image out_1 maybe you can add this resize code in colab

my question is how can we get multiple answers for example in the above image there are two Names From Name, To Name Expected output [{'question': 'What is the Name?', 'answer': 'nice client', 'answer': 'Terlici Ltd.'}] or something similar

Sep 11 '22 11:09 qustions

Hello @SamSamhuns Can i use your code to get multiple answers as mention above From Name, To Name Expected output [{'question': 'What is the Name?', 'answer': 'nice client', 'answer': 'Terlici Ltd.'}]

Sep 12 '22 23:09 qustions

I tried the same on one of my use cases input image:

some part of my JSON:

So I entered token_indexes w.r.t date_of_loss

decoder_cross_attentions = output["attentions"]["cross_attentions"]
token_indexes = [10, 11, 12, 13, 14, 15, 16, 17]

Using the code provided in the colab I got the following:

so even though I am getting correct output in the JSON, the position of the heatmap doesn't match with the position of the date present in the image Same thing happens with other keys as well e.g. cause of loss

so is there anything which I am missing?

Apr 04 '23 11:04 pathikg

Maybe try some other saliency and feature map displaying methods from this repo https://github.com/jacobgil/pytorch-grad-cam

Apr 12 '23 06:04 SamSamhuns

@SamSamhuns How do you use grad-cam to get donut bounding boxes, meaning what's the target layer? The cross attentions ?

Sep 05 '23 16:09 ahmedplateiq

@SamSamhuns Where to download model_custom.py ?

Oct 02 '23 18:10 ariefwijaya

@ariefwijaya you can find it in my fork of the repo https://github.com/SamSamhuns/donut

Oct 03 '23 18:10 SamSamhuns

@ariefwijaya you can find it in my fork of the repo https://github.com/SamSamhuns/donut

@SamSamhuns Just found and tried the model_custom.py , but I got this error, is there something I missed?

  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1105, in _init_weights
    raise NotImplementedError(f"Make sure `_init_weights` is implemented for {self.__class__}")
NotImplementedError: Make sure `_init_weights` is implemented for <class 'donut.donut.model_custom.DonutModel'>

I'm using timm==0.5.4 and transformers==4.25.1,

Oct 04 '23 04:10 ariefwijaya

I tried the same on one of my use cases input image:

some part of my JSON:

So I entered token_indexes w.r.t date_of_loss
decoder_cross_attentions = output["attentions"]["cross_attentions"]
token_indexes = [10, 11, 12, 13, 14, 15, 16, 17]
Using the code provided in the colab I got the following:

so even though I am getting correct output in the JSON, the position of the heatmap doesn't match with the position of the date present in the image Same thing happens with other keys as well e.g. cause of loss

so is there anything which I am missing?

I got the same issue, did you find any solution for that. it would be really helpful

Nov 24 '23 05:11 sumairrasi

https://github.com/Unstructured-IO/unstructured-inference/blob/main/unstructured_inference/models/chipper.py#L415

See the get_bounding_box function

Jan 08 '24 17:01 0-hero

Has anyone considered or tried training the Donut model with bounding box information included in the json? Would there be any reason why this approach cannot work given this model's architecture? If I'm not wrong the Swin trasformer used by Donut for the vision encoding half is capable of object detection and so should be encoding the location information as well.

Feb 27 '24 06:02 akash-ravikumar

Has anyone considered or tried training the Donut model with bounding box information included in the json? Would there be any reason why this approach cannot work given this model's architecture? If I'm not wrong the Swin trasformer used by Donut for the vision encoding half is capable of object detection and so should be encoding the location information as well.

This person (link below) has attempted just that, very interesting, however the results were not satisfying : https://github.com/ivelin/donut_ui_refexp?tab=readme-ov-file

Mar 19 '24 09:03 y-gs

@y-gs I would recommend this work (https://openreview.net/pdf?id=3wReeptY6X). This work was done a year ago and will appear in NAACL 2024. The model will be released by May 1st.

Mar 19 '24 10:03 Veason-silverbullet

So, I've found a way to generate the heatmaps from the cross attentions from the decoder. However, the attention maps correspond to each output token from the decoder and not necessarily a word i.e. the word Restaurant might consist of three tokens (Res + tau + rant) and the attention-heatmaps are very coarse and might not give precise boxes as shown in the example.

@SamSamhuns Thank you for sharing your code with us. It really is a great way to visualize the cross attention.

This maybe a very beginner level issue, but I am getting this error when I tried to load a custom fine-tuned and locally saved model checkpoint: NotImplementedError: Make sure '_init_weights' is implemented for <class 'donut.model_custom.DonutModel'>

I tried the proposed solutions at issue #184 but they didn't work me. If you have any ideas on how to resolve this issue, it'll be a big help.

My code is as follows:

import torch
from PIL import Image
from donut import DonutModelCustom #removed DonutModel as it is not used
from pprint import pprint 

model = DonutModelCustom.from_pretrained("local-directory")

if torch.cuda.is_available():
    model.half()
    device = torch.device("cuda")
    model.to(device)

model.eval()

Apr 26 '24 04:04 kirby707

Hello, it has been a long time since I have worked with the custom fork of the Donut model if you are using that and many things might be out of sync. Unfortunately due to a lack of time on my part, I will be unable to help any time soon. Hope you resolve the issue.

Apr 29 '24 07:04 SamSamhuns

@kirby707 Hey, have you found a solution yet?

May 03 '24 09:05 Flashness123

@Flashness123 Hi, I think this repo https://github.com/Veason-silverbullet/ViTLP could help you.

May 03 '24 09:05 Veason-silverbullet

@Veason-silverbullet Thank you, but I have developped a big finetuned donut model for extracting information from custom documents. Unfortunately the VisionEncoderDecoder is not good at ouputting any numbers. Does anyone know why? Is it because of the OCR free handshake between the encoder and decoder? Now I want to have a heatmap and extract these found but wrongly read Numbers seperately to an OCR

May 03 '24 09:05 Flashness123

@SamSamhuns Thank you for your reply. I understand that the custom fork was created quite a while ago. Appreciate your work on the concept.

May 07 '24 00:05 kirby707

@Flashness123 I still haven't found any solutions but I'm starting to suspect that it's failing because my finetuned model is based on the HuggingFace implementation (VisionEncoderDecoderModel from the Transformers library). I wonder if it'll work if the finetuned model is based on the implementation by the Donut developer team.

May I ask if your finetuned model is also based on the HuggingFace implementation?

May 07 '24 00:05 kirby707