Exotic sheet format impact over table/cell recognition?

Open mpsbrazil opened this issue 2 years ago • 1 comments

Hi, Xavier. Thank you for your library. I'm using it to scratch data from public documents, with good results. Despite the fact the single digit issue from OCR engine continue to bodering us, I'm facing another challenge: recognize de cell even if the OCR doesn't find the text in it.

Take a look on those pictures. First, the original PDF page.

Note table is a 6 collumns by 2 rown size at this page, and that was the extracted table in Excel output (a perfect match depite TesseractOCR doesn't recognized those "one digit" number inside cells B3, B4, E3, E4, F3 and F4.)

Now the second page of PDF. It keeps its 6 collumns size and may vary at rows number depending of the height of the row.

Now, note that the resulting sheet at Excel file doesn't match the 6 collumns size of table, and that's the problem.

Could you please confirm this isn't a bug?

Dec 28 '23 19:12 mpsbrazil

Hello, This is not really supposed to happen. Can you apply the extraction without any OCR and check the number of columns in your table (using the extract_tables method) ? If it is simpler for you, you can just provide me the document.

What I am suspecting is that, as no content is detected from the OCR for those columns, they are getting dropped when the table content is getting populated.

Jan 07 '24 22:01 xavctn