pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Allow to configure a mimimal length of the edges that are considered for a line reconstruction

Open bronislav opened this issue 11 months ago • 5 comments

Small edges (shorter than 1) are ignored before attempting to reconstruct a line from them in https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L668-L671

The filter_edge method has a parameter min_length that currently is set by default to 1. I stumbled across a PDF file where a line is represented as small dashes of length 0.94 that are being filtered out, and the line is ignored when extracting the table.

At first, I thought to use the edge_min_length settings for this, but it seems this might not be backward compatible. Introducing an additional setting might be a better approach.

I am willing to submit a pull request, but first, I want to know if you are willing to accept this improvement.

bronislav avatar Feb 13 '25 03:02 bronislav

Interesting edge-case, @bronislav. Thanks for sharing and flagging. I think passing edge_min_length to the filter_edge calls should work ... but perhaps I'm overlooking something?

jsvine avatar Feb 13 '25 04:02 jsvine

It should work, but this setting filters out small edges after merging. Some cases require different thresholds for edge filtering before and after merging.

I'm thinking about all possible corner cases. If this is not a concern right now, I will go ahead and submit a pull request.

bronislav avatar Feb 13 '25 14:02 bronislav

Ah, I see what you mean. Yes, I think you're correct that it's a good idea to allow these to be set separately. Here's an attempt to implement that, now on develop: https://github.com/jsvine/pdfplumber/commit/42a004f6b245b6ac7d0c1504d6967cc625d66567

Do you happen to have a PDF you could share that would be useful for testing this setting?

jsvine avatar Jul 20 '25 12:07 jsvine

Unfortunately, PDF where I encountered this problem is my bank statement, which I don't want to share for obvious reasons. However, I'll test this change on that document myself and provide feedback. If I'll be able to modify that pdf or generate similar one I'll share it as well.

bronislav avatar Jul 20 '25 20:07 bronislav

I was able to test it on one of my bank statements that has this weirdly formatted dotted line and this new settings helped to configure table extractor properly.

bronislav avatar Oct 10 '25 05:10 bronislav