borb icon indicating copy to clipboard operation
borb copied to clipboard

fix:table detection

Open hdoer opened this issue 1 year ago • 4 comments

There are two functions cast the line's endpoints to int value for get the unique value. The two functions are "_determine_number_of_rows_and_columns" and "_determine_table_cell_boundaries". This conversion causes "_is_unbroken" function run error.

I read this part of code. Usually to judging intersection between the lines based on their distance of the line's x / y value.

But the function "_determine_table_cell_boundaries" violate this convention for get the unique value. And then the function "_is_unbroken" use the converted int value.

This will throws an error: "A Rectangle must have a non-negative height" sometimes.

Example: original: xs: [110.81, 484.49, 110.81, 484.49, 110.81, 484.49, 111.05, 111.05, 207.35, 207.35, 262.85, 262.85, 315.65, 315.65, 430.25, 430.25, 484.25, 484.25] ys: [526.19, 526.19, 557.89, 557.89, 647.49, 647.49, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25]

sorted unique xs: [110, 111, 207, 262, 315, 430, 484] ys:[525, 526, 557, 647]

In addition, the code "min(l.y0, l.y1) <= r.get_y() and max(l.y0, l.y1) >= r.get_y() + r.get_height()" in function "_is_unbroken", r.get_y() used converted value but l.y0 / l.y1 use the original value. Converted value must <= original value, so the above code will return false aways when l and r has same endpoint.

The commit for fix the problem above.

hdoer avatar Aug 23 '23 08:08 hdoer

Can you provide me with a PDF of where the previous code fails?

jorisschellekens avatar Aug 23 '23 08:08 jorisschellekens

Sorry, I can't provide the original pdf document. Because the pdf involves privacy.

hdoer avatar Aug 23 '23 09:08 hdoer

test.pdf I draw a table in the test.pdf use reportlab. The table has 9 line segments. The problem mentioned above can be reproduced with the test.pdf. The coordinates of these line segments are the same as those mentioned above. But reportlab takes the lower left corner as the origin.

hdoer avatar Aug 23 '23 13:08 hdoer

I encountered the same problem.There is my test PDF. icbc.pdf

Anwen954 avatar Nov 12 '23 02:11 Anwen954