camelot icon indicating copy to clipboard operation
camelot copied to clipboard

[ TextEdges ] allow single non-empty char textline

Open pushkarnimkar opened this issue 5 years ago • 1 comments

In TextEdges.generate function, updating criteria of selection of textline to atleast one non white-space character. Currently, this is set to two due to to greater than one condition.

This was causing problem when trying to extract tables from this file.

In the attached file, cells post Manipur (16th row) on first page mostly contain single digit values which were not updating the TextEdges object. Hence the table bbox geneated later only covered values upto Manipur. It was fixed by allowing single non white-space character cell to pass.

Besides, post this change, somehow one extra test case got passed on my system (not sure how)

pushkarnimkar avatar Apr 11 '20 10:04 pushkarnimkar

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

MartinThoma avatar Feb 25 '24 11:02 MartinThoma