donut
donut copied to clipboard
How to get the bounding boxes of the extracted entities?
It would be great if donut has ability to extract the bounding boxes of each entity extracted. The bounding box information is important and useful for visualizing and down stream tasks.
As far as I know, there isn't actually any bounding boxes. The image is encoded into features, but not actual boxes.
If you need boxes, you are better off using traditional OCR + modelling (layoutlmv2/3 are great options for this approach)
Hi, thanks to @logan-markewich for the helpful comment :)
donut does not require any bounding box annotation/supervision during the model training. But, as a result, there are no actual boxes in the model output. Instead, you can get an attention heatmap that could be used for your purpose. See Figure 8 of https://arxiv.org/abs/2111.15664 also. The related code line is at:
- https://github.com/clovaai/donut/blob/1.0.5/donut/model.py#L492
You may convert the heatmap to bounding boxes. The following link might be useful to you:
- https://stackoverflow.com/a/58421765
Hope this helps. Please let me know if you are still confused.
@WeiquanWa , did you manage to get some semblance of bounding boxes or the cross-attention heatmap from the outputs?
I cannot interpret the structure of the output attention maps from "cross_attentions": decoder_output.cross_attentions.
I see it is a tuple of tuples with the outer length being equal to the number of tokens (len(decoder_output.sequences)) but there are 4 sub-tuples inside each of shape torch.Size([1, 16, 1, 1200]). Not sure how to get representative heatmaps from these tensors.
I second this question. I assume these attention masks are to be translated to the prior stage (before they were encoded) to be able to match with the actual image shape, but I can't seem to figure out how to do this.
Like suggested in #31, I think Donut would benefit greatly from returning bounding boxes to allow further post-processing and output validation using fuzzy matching approaches with OCR results.
@gwkrsrch, could you give us the code that was used to generate the heatmap visualization in Figure 8 of the DONUT paper?
I've found updates at #45