pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Table extraction bug when lines are just barely end-to-end

Open jsvine opened this issue 1 year ago • 1 comments

Describe the bug

Via https://github.com/jsvine/pdfplumber/discussions/1087#discussioncomment-8564694, it seems that there's a bug in how pdfplumber joins lines.

Have you tried repairing the PDF?

Yes.

Code to reproduce the problem

Download the PDF in the linked comment. Then:

import pdfplumber
pdf = pdfplumber.open("2022.Sustainability.Report_NYSE_WM_2022.pdf")
page = pdf.pages[41]
im = page.to_image()
im.reset().debug_tablefinder({
    "join_x_tolerance": 0
})

image

And compare to:

(
    im.reset()
    .draw_lines(
        pdfplumber.table.merge_edges(
            pdfplumber.utils.filter_edges(page.edges, "h"),
            snap_x_tolerance=0,
            snap_y_tolerance=0,
            join_x_tolerance=-1,
            join_y_tolerance=0,
        )
    )
)

image

PDF file

See linked issue.

Expected behavior

pdfplumber's table-finding approach should merge all the sub-lines in each visual line into a single line.

Actual behavior

The method appears to do something strange with the lines, "finding" only certain portions of them.

Screenshots

See above

Environment

  • pdfplumber version: 0.11.0
  • Python version: 3.10.4
  • OS: Mac

jsvine avatar Mar 11 '24 21:03 jsvine

oh same here

https://github.com/jsvine/pdfplumber/issues/1296

kalle07 avatar May 17 '25 14:05 kalle07