table-transformer icon indicating copy to clipboard operation
table-transformer copied to clipboard

Issue with overlapping columns and rows.

Open Prabhav55 opened this issue 1 year ago • 5 comments

Hi,

I have been using Table Transformer for a project related to extraction and I had a few questions regarding the pre and post processing of outputs:

  1. Currently I am using the DETR feature extraction as the post processing tool for the output. While the accuracy is good, for a threshold of around 60%, I am observing a lot of overlap in the columns. For example, if three columns are present, the output includes five columns with overlap. On reducing the threshold, the output quality decreases for other input. Sample images are attached below:
Screenshot 2023-06-28 at 10 57 28 AM Screenshot 2023-06-28 at 10 57 34 AM
  1. Is there a way to increase padding for columns in post processing?

Thanks for the help! Happy to provide any other information necessary.

Prabhav55 avatar Jun 28 '23 05:06 Prabhav55

Hi,

Are you using the model trained only on PubTables-1M? I can see why that model would be confused: it hasn't seen very many tables (if any) where a dollar sign is that far to the left within the column. Have you tried training TATR with FinTabNet.c? We have a script to process the FinTabNet dataset into a dataset called FinTabNet.c that can be used to train TATR. That should help a lot. We have already trained a model jointly on PubTables-1M and FinTabNet.c but we still need to get approval to release the weights.

Cheers, Brandon

bsmock avatar Jun 28 '23 05:06 bsmock

Hi,

Thanks for the quick help. I was trying to look for a way to improve performance with post processing (Due to memory constraints for training) but I think you are right on the fine-tuning part. Just a side question - Is the DETR feature extractor the recommended post processor for table-transformer? HuggingFace also has am AutoImageProcessor.

Thanks, Prabhav

Prabhav55 avatar Jun 28 '23 05:06 Prabhav55

@bsmock would i need to modify the detection_config.json and structure_config.json when i train the TATR with the FinTabNet dataset?

linkstatic12 avatar Aug 07 '23 18:08 linkstatic12

I have found that easyOCR is much better than Tesseract when it comes to OCR on PDFs with table and financial data. Also I am trying to use TrOCR with TATR to resolve the issue I am working on. Do the sites like docsumo and extracttables use the TATR or CascadeTabNet. In your opinion which is better CascadeTabNet or TATR? docsumo: docsumo.com extracttables: https://extracttable.com/ CascadeTabNet: https://github.com/DevashishPrasad/CascadeTabNet/tree/master

linkstatic12 avatar Aug 07 '23 18:08 linkstatic12

You will get an error while running the process_fintabnet.py just modify the code at line 1340: From this: with open(save_filepath, 'w') as out_file: To this: with open(save_filepath, 'w',encoding="utf-8") as out_file:

linkstatic12 avatar Aug 08 '23 07:08 linkstatic12