donut
donut copied to clipboard
Bounding boxes required for pretraining?
Does the pre-training of Donut require bounding boxes of individual words? In the synthetically generated SynthDoG dataset (https://huggingface.co/datasets/naver-clova-ix/synthdog-en), which was also used for Donut pretraining, there are no bounding boxes, so I assume that the visual corpus described in the paper also lacks boundig box coordinates.
Im not one of the authors, but as far as I understood Donut only pre-trained on the generated OCR, not the hOCR which would include bounding boxes. Models like UDOP, LILT or LayoutLM come to mind, which do pretty much what you desribe for pre-training and they get good results with the approach.