Allow to configure a mimimal length of the edges that are considered for a line reconstruction
Small edges (shorter than 1) are ignored before attempting to reconstruct a line from them in https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py#L668-L671
The filter_edge method has a parameter min_length that currently is set by default to 1. I stumbled across a PDF file where a line is represented as small dashes of length 0.94 that are being filtered out, and the line is ignored when extracting the table.
At first, I thought to use the edge_min_length settings for this, but it seems this might not be backward compatible. Introducing an additional setting might be a better approach.
I am willing to submit a pull request, but first, I want to know if you are willing to accept this improvement.
Interesting edge-case, @bronislav. Thanks for sharing and flagging. I think passing edge_min_length to the filter_edge calls should work ... but perhaps I'm overlooking something?
It should work, but this setting filters out small edges after merging. Some cases require different thresholds for edge filtering before and after merging.
I'm thinking about all possible corner cases. If this is not a concern right now, I will go ahead and submit a pull request.
Ah, I see what you mean. Yes, I think you're correct that it's a good idea to allow these to be set separately. Here's an attempt to implement that, now on develop: https://github.com/jsvine/pdfplumber/commit/42a004f6b245b6ac7d0c1504d6967cc625d66567
Do you happen to have a PDF you could share that would be useful for testing this setting?
Unfortunately, PDF where I encountered this problem is my bank statement, which I don't want to share for obvious reasons. However, I'll test this change on that document myself and provide feedback. If I'll be able to modify that pdf or generate similar one I'll share it as well.
I was able to test it on one of my bank statements that has this weirdly formatted dotted line and this new settings helped to configure table extractor properly.