pdfplumber
pdfplumber copied to clipboard
For text extraction, add fractional versions of `x/y_tolerance` arguments
Currently, x_tolerance
and y_tolerance
are treated as numeric constants. But, as @Sarke points on in https://github.com/jsvine/pdfplumber/issues/606#issuecomment-1703456276, it could be useful to provide a "fractional" version of these arguments:
[...] ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.
Implementing this correctly might be tricky, as x/y_tolerance
are passed across a few methods, but it should be doable. Some other things to sort out:
- Let's say the new x-related parameter is
x_tolerance_fraction
. Should the value be a number (indicating desired fractional threshold) or a boolean (indicating thatx_tolerance
should be interpreted as a fraction)? - What should the precise calculation be? Something like (in pseudocode)
x_tolerance = current_character["size"] * x_tolerance_fraction
?
Any other questions or complications I may be overlooking?
is anyone working on this? I'd like to take a crack at it if not!
@afriedman412 I'm not aware of anyone actively working on this, thanks for checking — and thanks for offering! Would be wonderful if you took a crack at it.
great -- do you have a pdf with extra-tight letters I can use for testing?
(the pdf in the original issue worked fine with x_toleraance=1
sooo....)
@afriedman412 How about something like this?: issue-987-test.pdf
import pdfplumber
pdf = pdfplumber.open("issue-987-test.pdf")
page = pdf.pages[0]
for x in [ 0, 3, 10 ]:
print(f"--- x_tolerance = {x} ---")
print(page.extract_text(x_tolerance=x))
print("")
... outputs this:
--- x_tolerance = 0 ---
Big Te xt
Small Te xt
--- x_tolerance = 3 ---
Big Te xt
Small Text
--- x_tolerance = 10 ---
Big Text
SmallText
sorry im confused -- what do we want it to output?
Ah, my apologies for not being more explicit. Ideally, the proportional tolerance feature would make it possible to get this back:
Big Text
Small Text
The examples above show (or try to show) that non-proportional tolerances either under-condense the big text or over-condense the small text.
is there a less dumb way to get text size than int(text['bottom'] - text['top'])
?
Are you looking at char
objects, or something else (which I might infer from the variable being named text
)? If char
objects:
-
char["height"]
already gets you that calculation (without the floor-ing thatint(...)
introduces) - ... but it might be more idiomatic/expected to use
char["size"]
as the base
I'd have to check more carefully, but I believe those two values should typically be the same.
Honestly I'm lazy and couldn't find an easy way to extract char
objects from text, so now it works with anything with a top
and bottom
param. The int
isn't necessary -- I guess I assumed font sizes are always whole numbers?
def get_char_tolerance(t, x_tolerance):
"""Scales x_tolerance to font size (height of text)"""
if "bottom" in t.keys() and "top" in t.keys():
return int(t['bottom'] - t['top'])*x_tolerance
else:
raise KeyError("Couldn't get height of text for x_tolerance scaling.")
Anyways, fractional x_tolerance
mostly works now! The default value is 0.15, because "issue-987-test.pdf"
rendered both text sizes correctly between 0.06 and 0.2, and it passed the unit test, and I like round numbers.
Some questions:
- Where are the table-extraction values set? Would it make sense to define all the text defaults in one place? (
DEFAULT_X_TOLERANCE
is currently just at the top oftext.py
.) - How flexible does this param need to be? I was thinking about scaling it so the default value is 1, but I'm not sure what the range would be
- I also kind of feel like making the number a decimal implies scaling, whereas a whole number implies total units
- That said, do we want to leave the option in for explicit tolerance?
Thanks, @afriedman412! I'll address your specific questions below, but first this seems like a good opportunity for me to sketch out a bit more about how I see this working:
- The current default settings will continue to apply — i.e., by default,
.extract_text(...)
and related methods will continue to use explicit (rather than proportional) tolerances. - Those methods will gain two new parameters:
x_tolerance_ratio
andy_tolerance_ratio
, which will both default toNone
. ("_ratio" seems a bit better name than my original proposal, "_fraction".) - If
x_tolerance_ratio
is set to a number (integer or float), it will override any explicitx_tolerance
(and likewisey_tolerance
ify_tolerance_ratio
is set). - When determining whether the horizontal distance between Character A and Character B exceeds the
x_tolerance_ratio
, we calculatex_tolerance
to beA["size"] * x_tolerance_ratio
(and something analogous for vertical distances).
As a matter of actual implementation, things get tricky, as these tolerances are used in several parts of pdfplumber
:
- Separating lines from one another, via
cluster_objects
:- For
layout=True
: https://github.com/jsvine/pdfplumber/blob/d9561d15ccc7858446a925b8040f7c69e0bdf5ec/pdfplumber/utils/text.py#L466-L468 - ... and: https://github.com/jsvine/pdfplumber/blob/d9561d15ccc7858446a925b8040f7c69e0bdf5ec/pdfplumber/utils/text.py#L227-L232
- For
layout=False
: https://github.com/jsvine/pdfplumber/blob/d9561d15ccc7858446a925b8040f7c69e0bdf5ec/pdfplumber/utils/text.py#L538
- For
- Separating words from one another in the same line: https://github.com/jsvine/pdfplumber/blob/d9561d15ccc7858446a925b8040f7c69e0bdf5ec/pdfplumber/utils/text.py#L347-L351
... and possibly a few other places, not to mention where these utility functions are integrated into the Page.extract_...
methods. This is not to discourage you from implementing this, but rather just a note that there are some tricky bits.
Now on to the specific questions:
Where are the table-extraction values set?
I think here you're asking about the default parameter values for the table-extraction methods? If so, you can find the table-specific ones at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py
... while the general text extraction defaults are set at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py
Would it make sense to define all the text defaults in one place? (DEFAULT_X_TOLERANCE is currently just at the top of text.py.)
I'd prefer to keep as-is.
How flexible does this param need to be? I was thinking about scaling it so the default value is 1, but I'm not sure what the range would be
See above, which I believe should answer that question, but let me know if not.
I also kind of feel like making the number a decimal implies scaling, whereas a whole number implies total units
I think we should give users the option to specify these ratios as any number, to give them as much precision as they'd like.
That said, do we want to leave the option in for explicit tolerance?
See above; I think the explicit tolerance should remain the default, both for backward-compatibility's sake and for predictability's sake (the explicit value does not depend on character order, which I believe the ratio version will).
Happy to clarify any of the above and to answer any follow-up Qs! And thanks again!
thanks for all this
my big question is do we want to make calculating tolerances dynamic?
like right now my approach is basically just using the size of first character available to calculate the tolerance. if we are iterating through the lines on a page with cluster_objects
we could theoretically adjust ratios on the fly...
my big question is do we want to make calculating tolerances dynamic? like right now my approach is basically just using the size of first character available to calculate the tolerance.
Good question. At the very least, we want the tolerances to be dynamic between lines.
I think the answer is less clear within lines. The argument in favor of calculating tolerances on a per-character (i.e., within line) basis would be more flexibility for lines containing text of different sizes. Not necessarily the most common occurrence, but a possibility. The argument against would be greater complexity and a (small, probably negligible) performance hit. I'd say let's start experimenting / prototyping without that, and then see how much of a hassle it'd be to add it.
are we sure we need y_tolerance_ratio
? off top it feels like line spacing is much less dependent on font size...
I'm going to implement x_tolerance first and we can go from there
I think that's a reasonable (and smartly constrained) place to start. I think y_tolerance_ratio
would still be helpful, but x_tolerance_ratio
would certainly still be useful on its own.