donut icon indicating copy to clipboard operation
donut copied to clipboard

How to get the bounding boxes of the extracted entities?

Open WeiquanWa opened this issue 3 years ago • 1 comments

It would be great if donut has ability to extract the bounding boxes of each entity extracted. The bounding box information is important and useful for visualizing and down stream tasks.

WeiquanWa avatar Aug 09 '22 10:08 WeiquanWa

As far as I know, there isn't actually any bounding boxes. The image is encoded into features, but not actual boxes.

If you need boxes, you are better off using traditional OCR + modelling (layoutlmv2/3 are great options for this approach)

logan-markewich avatar Aug 14 '22 18:08 logan-markewich

Hi, thanks to @logan-markewich for the helpful comment :)

donut does not require any bounding box annotation/supervision during the model training. But, as a result, there are no actual boxes in the model output. Instead, you can get an attention heatmap that could be used for your purpose. See Figure 8 of https://arxiv.org/abs/2111.15664 also. The related code line is at:

  • https://github.com/clovaai/donut/blob/1.0.5/donut/model.py#L492

You may convert the heatmap to bounding boxes. The following link might be useful to you:

  • https://stackoverflow.com/a/58421765

Hope this helps. Please let me know if you are still confused.

gwkrsrch avatar Aug 17 '22 05:08 gwkrsrch

@WeiquanWa , did you manage to get some semblance of bounding boxes or the cross-attention heatmap from the outputs?

I cannot interpret the structure of the output attention maps from "cross_attentions": decoder_output.cross_attentions.

I see it is a tuple of tuples with the outer length being equal to the number of tokens (len(decoder_output.sequences)) but there are 4 sub-tuples inside each of shape torch.Size([1, 16, 1, 1200]). Not sure how to get representative heatmaps from these tensors.

SamSamhuns avatar Sep 01 '22 13:09 SamSamhuns

I second this question. I assume these attention masks are to be translated to the prior stage (before they were encoded) to be able to match with the actual image shape, but I can't seem to figure out how to do this.

Like suggested in #31, I think Donut would benefit greatly from returning bounding boxes to allow further post-processing and output validation using fuzzy matching approaches with OCR results.

leitouran avatar Sep 05 '22 18:09 leitouran

@gwkrsrch, could you give us the code that was used to generate the heatmap visualization in Figure 8 of the DONUT paper?

SamSamhuns avatar Sep 06 '22 06:09 SamSamhuns

I've found updates at #45

SamSamhuns avatar Sep 09 '22 07:09 SamSamhuns