DocBank icon indicating copy to clipboard operation
DocBank copied to clipboard

some labels are missing

Open alireza-hariri opened this issue 1 year ago • 1 comments

image I just noticed that some words in the cover image are missing.

I couldn't find any code for generating this dataset from the original docs to suggest an edit.

Note: The second error in the image is the word "second" which splited with a dash. This err makes sense but I couldn't reason about the first error.

alireza-hariri avatar Nov 07 '24 05:11 alireza-hariri

after more inspection i found some other problems

but there are some other problems with box sizes:

  1. There are a lot of boxes with zero width or height (even when the label is "paragraph" and the token doesn't include "Line##" )
  2. There are a lot of boxes (with paragraph label) that are too tall (see the image)

image

alireza-hariri avatar Nov 07 '24 09:11 alireza-hariri

@alireza-hariri Same here! How did you solve it ? Also the id ↔ name mappings are inconsistent across the train, validation and test sets.

dimitri009 avatar May 02 '25 12:05 dimitri009