pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Bug: line_to_edge incorrectly classifies vertical lines as horizontal due to floating point precision

Open teruru331 opened this issue 10 months ago • 5 comments

Description

I found a bug in the line_to_edge function in pdfplumber.utils. It incorrectly classifies some vertical lines as horizontal due to floating point precision issues.

Currently, line_to_edge determines the orientation as follows:

edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v"

However, since top and bottom are floats, small floating-point differences can cause incorrect classifications.
For example, if top = 50.000 and bottom = 50.001, it mistakenly recognizes the line as horizontal ("h"), even though it is actually vertical.

Steps to Reproduce

  1. Use pdfplumber to extract table lines from a PDF.
  2. If some vertical lines have a small difference between top and bottom, they may be classified as horizontal.
  3. This can lead to incorrect table extraction results.

Example of an incorrectly classified line:

line = {"x0": 100, "x1": 200, "top": 50.000, "bottom": 50.001}
print(line_to_edge(line))  # Expected: "v", but gets "h"

Expected Behavior

The function should consider a small threshold (epsilon) when determining if a line is horizontal or vertical.

Suggested Fix

Modify line_to_edge to account for floating point precision, like this:

def line_to_edge(line: dict, epsilon: float = 1e-3) -> dict:
    edge = dict(line)
    if abs(line["top"] - line["bottom"]) < epsilon:
        edge["orientation"] = "h"
    else:
        edge["orientation"] = "v"
    return edge

This way, small floating point differences won't cause incorrect classifications.

Environment

  • pdfplumber version: v0.11.5
  • Python version: 3.2.2
  • OS: Windows

Additional Context

This issue affects table extraction accuracy because misclassified lines can break table structure detection.
Adding a small threshold (epsilon) would improve robustness.


---

teruru331 avatar Feb 22 '25 07:02 teruru331

Hi @teruru331, thank you for raising this issue. It does seem worth solving in some way. I just haven't gotten to it yet.

jsvine avatar Jun 12 '25 03:06 jsvine

Another issue needed to be considered.

def line_to_edge(line: T_obj) -> T_obj:
    edge = dict(line)
    if line["top"] == line["bottom"]:
        edge["orientation"] = "h"
    elif line['width'] / line['height'] > 1:
        edge["orientation"] = "h"
    else:
        edge["orientation"] = "v"
    # edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v"
    return edge

zhiruiluo avatar Sep 16 '25 23:09 zhiruiluo

Hello Maintainers,

I am a computer science student working on an open-source contribution as part of a university assignment.

I would like to take on Issue #1276, "Bug: line_to_edge incorrectly classifies vertical lines as horizontal due to floating point precision."

The suggested fix by @teruru331 is clear and robust. I intend to proceed by:

  1. Implementing the epsilon tolerance mechanism within line_to_edge.
  2. Writing comprehensive unit test cases to verify the fix and prevent future regressions, including the example provided in the issue description.

Could you please assign this issue to me so I can finalize the fix and submit a Pull Request?

Thank you for maintaining this valuable project

Mukeshkumar-Vadivelu avatar Oct 24 '25 09:10 Mukeshkumar-Vadivelu

Hi, I'm the one who originally opened this issue.

First of all, I really appreciate that you're trying to contribute to an open-source project — that’s wonderful.

Just to add a bit of context: this problem cannot be treated as a simple mistake.
For example, if a table in the PDF is rotated by 45 degrees, how should we distinguish between horizontal and vertical lines?
The PDF structure itself allows such cases.

To properly handle this, we’d need to estimate and correct the document’s rotation first, which is not trivial.
However, since many pdfplumber users process PDFs exported from Word or Excel — which are usually well-aligned —
the current line_to_edge behavior is not a critical problem in most cases.

That said, in my experience, there are rare cases where vertical and horizontal lines are misclassified,
so I think it’s worth addressing.

My suggestions are:

  1. Fix it by introducing an epsilon-based tolerance.
  2. Add a comment referencing this issue for future contributors who might have similar concerns.

Thanks again for your contribution and enthusiasm!

teruru331 avatar Oct 24 '25 11:10 teruru331

Hi teruru331,

Thank you for the detailed context. I completely agree with your views. I implemented the fix as suggested, introducing an epsilon tolerance (defaulting to 1e-3) when determining orientation.

The line_to_edge function now accepts an optional epsilon argument. If abs(top - bottom) is less than epsilon, the line is classified as horizontal ('h').

I added a comprehensive test case in test_utils.py that specifically checks the boundary conditions (just below, just above and exactly at the epsilon threshold) to ensure robust logic.

The documentation for the updated function now includes a note about the skew vs. precision trade-off, referencing the context you provided. It seems to me as a safe solution for the reported issue. Let me know if you have any thoughts on the test case or documentation.

Thank you!

Mukeshkumar-Vadivelu avatar Oct 25 '25 01:10 Mukeshkumar-Vadivelu