borb
borb copied to clipboard
fix:table detection
There are two functions cast the line's endpoints to int value for get the unique value. The two functions are "_determine_number_of_rows_and_columns" and "_determine_table_cell_boundaries". This conversion causes "_is_unbroken" function run error.
I read this part of code. Usually to judging intersection between the lines based on their distance of the line's x / y value.
But the function "_determine_table_cell_boundaries" violate this convention for get the unique value. And then the function "_is_unbroken" use the converted int value.
This will throws an error: "A Rectangle must have a non-negative height" sometimes.
Example: original: xs: [110.81, 484.49, 110.81, 484.49, 110.81, 484.49, 111.05, 111.05, 207.35, 207.35, 262.85, 262.85, 315.65, 315.65, 430.25, 430.25, 484.25, 484.25] ys: [526.19, 526.19, 557.89, 557.89, 647.49, 647.49, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25]
sorted unique xs: [110, 111, 207, 262, 315, 430, 484] ys:[525, 526, 557, 647]
In addition, the code "min(l.y0, l.y1) <= r.get_y() and max(l.y0, l.y1) >= r.get_y() + r.get_height()" in function "_is_unbroken", r.get_y() used converted value but l.y0 / l.y1 use the original value. Converted value must <= original value, so the above code will return false aways when l and r has same endpoint.
The commit for fix the problem above.
Can you provide me with a PDF of where the previous code fails?
Sorry, I can't provide the original pdf document. Because the pdf involves privacy.
test.pdf I draw a table in the test.pdf use reportlab. The table has 9 line segments. The problem mentioned above can be reproduced with the test.pdf. The coordinates of these line segments are the same as those mentioned above. But reportlab takes the lower left corner as the origin.
I encountered the same problem.There is my test PDF. icbc.pdf