Bug: line_to_edge incorrectly classifies vertical lines as horizontal due to floating point precision
Description
I found a bug in the line_to_edge function in pdfplumber.utils. It incorrectly classifies some vertical lines as horizontal due to floating point precision issues.
Currently, line_to_edge determines the orientation as follows:
edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v"
However, since top and bottom are floats, small floating-point differences can cause incorrect classifications.
For example, if top = 50.000 and bottom = 50.001, it mistakenly recognizes the line as horizontal ("h"), even though it is actually vertical.
Steps to Reproduce
- Use
pdfplumberto extract table lines from a PDF. - If some vertical lines have a small difference between
topandbottom, they may be classified as horizontal. - This can lead to incorrect table extraction results.
Example of an incorrectly classified line:
line = {"x0": 100, "x1": 200, "top": 50.000, "bottom": 50.001}
print(line_to_edge(line)) # Expected: "v", but gets "h"
Expected Behavior
The function should consider a small threshold (epsilon) when determining if a line is horizontal or vertical.
Suggested Fix
Modify line_to_edge to account for floating point precision, like this:
def line_to_edge(line: dict, epsilon: float = 1e-3) -> dict:
edge = dict(line)
if abs(line["top"] - line["bottom"]) < epsilon:
edge["orientation"] = "h"
else:
edge["orientation"] = "v"
return edge
This way, small floating point differences won't cause incorrect classifications.
Environment
- pdfplumber version: v0.11.5
- Python version: 3.2.2
- OS: Windows
Additional Context
This issue affects table extraction accuracy because misclassified lines can break table structure detection.
Adding a small threshold (epsilon) would improve robustness.
---
Hi @teruru331, thank you for raising this issue. It does seem worth solving in some way. I just haven't gotten to it yet.
Another issue needed to be considered.
def line_to_edge(line: T_obj) -> T_obj:
edge = dict(line)
if line["top"] == line["bottom"]:
edge["orientation"] = "h"
elif line['width'] / line['height'] > 1:
edge["orientation"] = "h"
else:
edge["orientation"] = "v"
# edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v"
return edge
Hello Maintainers,
I am a computer science student working on an open-source contribution as part of a university assignment.
I would like to take on Issue #1276, "Bug: line_to_edge incorrectly classifies vertical lines as horizontal due to floating point precision."
The suggested fix by @teruru331 is clear and robust. I intend to proceed by:
- Implementing the
epsilontolerance mechanism withinline_to_edge. - Writing comprehensive unit test cases to verify the fix and prevent future regressions, including the example provided in the issue description.
Could you please assign this issue to me so I can finalize the fix and submit a Pull Request?
Thank you for maintaining this valuable project
Hi, I'm the one who originally opened this issue.
First of all, I really appreciate that you're trying to contribute to an open-source project — that’s wonderful.
Just to add a bit of context: this problem cannot be treated as a simple mistake.
For example, if a table in the PDF is rotated by 45 degrees, how should we distinguish between horizontal and vertical lines?
The PDF structure itself allows such cases.
To properly handle this, we’d need to estimate and correct the document’s rotation first, which is not trivial.
However, since many pdfplumber users process PDFs exported from Word or Excel — which are usually well-aligned —
the current line_to_edge behavior is not a critical problem in most cases.
That said, in my experience, there are rare cases where vertical and horizontal lines are misclassified,
so I think it’s worth addressing.
My suggestions are:
- Fix it by introducing an epsilon-based tolerance.
- Add a comment referencing this issue for future contributors who might have similar concerns.
Thanks again for your contribution and enthusiasm!
Hi teruru331,
Thank you for the detailed context. I completely agree with your views. I implemented the fix as suggested, introducing an epsilon tolerance (defaulting to 1e-3) when determining orientation.
The line_to_edge function now accepts an optional epsilon argument. If abs(top - bottom) is less than epsilon, the line is classified as horizontal ('h').
I added a comprehensive test case in test_utils.py that specifically checks the boundary conditions (just below, just above and exactly at the epsilon threshold) to ensure robust logic.
The documentation for the updated function now includes a note about the skew vs. precision trade-off, referencing the context you provided. It seems to me as a safe solution for the reported issue. Let me know if you have any thoughts on the test case or documentation.
Thank you!