table-transformer icon indicating copy to clipboard operation
table-transformer copied to clipboard

Annotation Tool

Open abhayhk2001 opened this issue 2 years ago • 2 comments

Hi we are trying to use this model for custom training. We have a set of images we would like to fine tune on. We were able to generate the XML files using LabelImg. But the words.json file is a little tricky. Can you please share the annotation tool used or suggest an alternative.

abhayhk2001 avatar Dec 23 '22 11:12 abhayhk2001

For some context, the format and naming of the fields for the words JSON files originates with the text extraction in PyMuPDF, which for each word gives block_num, line_num, and span_num.

The current version of the Table Transformer code for incorporating text into the table extraction needs 'span_num' to give the numerical order in which words should be placed when assembling the text placed into each cell. 'line_num' and 'block_num' can both be set to 0 for all words as long as 'span_num' gives the reading order.

Going forward, I believe we should refactor the code to ignore these fields altogether and assume the list is already sorted in reading order. This would simplify things because then the only fields that would be needed for each word would be 'bbox' for the bounding box and 'text' for the text content.

bsmock avatar Mar 10 '23 22:03 bsmock

Check the newly-created scripts/ folder for code that creates the words JSON files from PDF for datasets where PDFs are available, such as PubTables-1M, FinTabNet, and SciTSR.

bsmock avatar Mar 10 '23 22:03 bsmock