pdfplumber
pdfplumber copied to clipboard
Table extraction bug when lines are just barely end-to-end
Describe the bug
Via https://github.com/jsvine/pdfplumber/discussions/1087#discussioncomment-8564694, it seems that there's a bug in how pdfplumber joins lines.
Have you tried repairing the PDF?
Yes.
Code to reproduce the problem
Download the PDF in the linked comment. Then:
import pdfplumber
pdf = pdfplumber.open("2022.Sustainability.Report_NYSE_WM_2022.pdf")
page = pdf.pages[41]
im = page.to_image()
im.reset().debug_tablefinder({
"join_x_tolerance": 0
})
And compare to:
(
im.reset()
.draw_lines(
pdfplumber.table.merge_edges(
pdfplumber.utils.filter_edges(page.edges, "h"),
snap_x_tolerance=0,
snap_y_tolerance=0,
join_x_tolerance=-1,
join_y_tolerance=0,
)
)
)
PDF file
See linked issue.
Expected behavior
pdfplumber's table-finding approach should merge all the sub-lines in each visual line into a single line.
Actual behavior
The method appears to do something strange with the lines, "finding" only certain portions of them.
Screenshots
See above
Environment
- pdfplumber version:
0.11.0 - Python version: 3.10.4
- OS: Mac
oh same here
https://github.com/jsvine/pdfplumber/issues/1296