PdfBox-Android icon indicating copy to clipboard operation
PdfBox-Android copied to clipboard

PDFTextStripper.getText misses some characters from a PDF file

Open jamiehiggins opened this issue 2 years ago • 2 comments

When attempting to extract text from the attached simple PDF file there are some characters missing within the text.

To reproduce the problem simply call pdfStripper.getText() on the attached pdf file (Problematic.pdf)

The text is mostly returned ok, however the following issues are present in the returned text:

making time to reflect and review your -> making time to reect and review your If you find it easier -> If you nd it easier

PdfBox-Android version: [e.g. 2.0.27.0] It happens on all versions of Android SDK (I have tried several)

Problematic.pdf

jamiehiggins avatar Jun 27 '23 12:06 jamiehiggins

This is an unsolved problem https://issues.apache.org/jira/browse/PDFBOX-3248

In this file, the /ToUnicode file maps ligatures to 0 and uses the /ActualText feature in the content stream which PDFBox doesn't support.

THausherr avatar Jul 29 '23 13:07 THausherr

Possible solution, that works with the linked file: https://issues.apache.org/jira/browse/PDFBOX-5868?focusedCommentId=17874189&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17874189

THausherr avatar Aug 16 '24 12:08 THausherr