Custom deduppe_chars char properties
Following up on #71 - I have had a problem with duplicated characters that have the same text but different properties (i.e. fonts). Unfortunately I can't share the file as it is private. I am requesting optionally providing custom properties to the deduplication function (please let me know if this is otherwise available!)
Here's a sketch of the proposed code changes:
https://github.com/jsvine/pdfplumber/blob/147f2c4c07dc1191fc1d05bb589b4f6af3aaf74a/pdfplumber/utils/text.py#L789
def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1, char_properties: Optional[List[str]] = None) -> T_obj_list:
"""
Removes duplicate chars — those sharing the same text, fontname, size,
and positioning (within `tolerance`) as other characters in the set.
"""
# key = itemgetter("fontname", "size", "upright", "text")
char_properties = char_properties if char_properties is not None else ["fontname", "size", "upright", "text"]
key = itemgetter(*char_properties)
<... more code>
The interfaces exposing this should also be updated.
The end result looks like
print(section.dedupe_chars(tolerance=0.1, char_properties=['text']).extract_text())
Thanks for the suggestion, @felix-hh. Are you able to share a version of the PDF redacted with https://github.com/JoshData/pdf-redactor? Or another PDF that demonstrates the same issue?
Hi @jsvine I made a good-faith attempt at redacting the pdf with the tool but the footer text is not redacted and can still be extracted. This is a problem because the footer identifies the data source which is proprietary. I also do not know how to reproduce the issue with my own pdf.
Let me know if there is some other way I can help. I am happy to provide a pull request for the change verifying that it works on my end.
Here's some screenshots if it helps:
Redacted PDF screenshot:
What the output of extract_text looks like:
Thanks @felix-hh. For new features, I like/want to have unit tests for them, which requires a PDF demonstrating a failing example. Could you use a tool (e.g., Adobe Acrobat, Preview, etc.) to manually redact the footer text?