donut Bounding boxes required for pretraining?

Bounding boxes required for pretraining?

Open mustaszewski opened this issue 1 year ago • 1 comments

Does the pre-training of Donut require bounding boxes of individual words? In the synthetically generated SynthDoG dataset (https://huggingface.co/datasets/naver-clova-ix/synthdog-en), which was also used for Donut pretraining, there are no bounding boxes, so I assume that the visual corpus described in the paper also lacks boundig box coordinates.

Dec 15 '23 10:12 mustaszewski

Im not one of the authors, but as far as I understood Donut only pre-trained on the generated OCR, not the hOCR which would include bounding boxes. Models like UDOP, LILT or LayoutLM come to mind, which do pretty much what you desribe for pre-training and they get good results with the approach.

Feb 02 '24 20:02 felixvor

donut donut copied to clipboard

Bounding boxes required for pretraining?

donut
donut copied to clipboard