DocBank
DocBank copied to clipboard
some labels are missing
I just noticed that some words in the cover image are missing.
I couldn't find any code for generating this dataset from the original docs to suggest an edit.
Note: The second error in the image is the word "second" which splited with a dash. This err makes sense but I couldn't reason about the first error.
after more inspection i found some other problems
but there are some other problems with box sizes:
- There are a lot of boxes with zero width or height (even when the label is "paragraph" and the token doesn't include "Line##" )
- There are a lot of boxes (with paragraph label) that are too tall (see the image)
@alireza-hariri Same here! How did you solve it ? Also the id ↔ name mappings are inconsistent across the train, validation and test sets.