pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Custom deduppe_chars char properties

Open felix-hh opened this issue 1 year ago • 4 comments

Following up on #71 - I have had a problem with duplicated characters that have the same text but different properties (i.e. fonts). Unfortunately I can't share the file as it is private. I am requesting optionally providing custom properties to the deduplication function (please let me know if this is otherwise available!)

Here's a sketch of the proposed code changes:

https://github.com/jsvine/pdfplumber/blob/147f2c4c07dc1191fc1d05bb589b4f6af3aaf74a/pdfplumber/utils/text.py#L789

def dedupe_chars(chars: T_obj_list, tolerance: T_num = 1, char_properties: Optional[List[str]] = None) -> T_obj_list:
    """
    Removes duplicate chars — those sharing the same text, fontname, size,
    and positioning (within `tolerance`) as other characters in the set.
    """
    # key = itemgetter("fontname", "size", "upright", "text")
    char_properties = char_properties if char_properties is not None else ["fontname", "size", "upright", "text"]
    key = itemgetter(*char_properties)

   <... more code>

The interfaces exposing this should also be updated.

The end result looks like

print(section.dedupe_chars(tolerance=0.1, char_properties=['text']).extract_text())

felix-hh avatar Mar 14 '24 00:03 felix-hh

Thanks for the suggestion, @felix-hh. Are you able to share a version of the PDF redacted with https://github.com/JoshData/pdf-redactor? Or another PDF that demonstrates the same issue?

jsvine avatar Mar 15 '24 20:03 jsvine

Hi @jsvine I made a good-faith attempt at redacting the pdf with the tool but the footer text is not redacted and can still be extracted. This is a problem because the footer identifies the data source which is proprietary. I also do not know how to reproduce the issue with my own pdf.

felix-hh avatar Mar 16 '24 21:03 felix-hh

Let me know if there is some other way I can help. I am happy to provide a pull request for the change verifying that it works on my end.

Here's some screenshots if it helps: Redacted PDF screenshot: image

What the output of extract_text looks like: image

felix-hh avatar Mar 16 '24 21:03 felix-hh

Thanks @felix-hh. For new features, I like/want to have unit tests for them, which requires a PDF demonstrating a failing example. Could you use a tool (e.g., Adobe Acrobat, Preview, etc.) to manually redact the footer text?

jsvine avatar Mar 25 '24 15:03 jsvine