pdfbox.cfc icon indicating copy to clipboard operation
pdfbox.cfc copied to clipboard

getText() may return Non-ASCII/UTF-8 characters

Open JamoCA opened this issue 2 years ago • 4 comments

PDFBox wasn't correctly parsing some italic yellow-on-brown text (not my PDF) and returned INFORMATION SEPARATOR ONE u001f (for Th) and INFORMATION SEPARATOR TWO u001e (for fi).

I normally use a java junidecode library to convert UTF-8 to ASCII7, but this wasn't working with these characters. I'd rather not have odd control-type characters in the text, so I used the following regex to strip high ASCII. (I figured the text was already wrong and it'd be better to omit these characters rather than retain them.)

text = rereplace(text, "[^\x20-\x7E]", "", "all");

Have you come across this issue before?

JamoCA avatar Jun 15 '23 00:06 JamoCA

I'm comparing the PDFBox 2.0.27 results against third-party services to see what they are capable of.

The Minion Pro text on the PDF (using FoxIt PDF Reader) appears in italics as: The Definitive Expert in Carmel ... but selecting & copying it returns the following when pasted into VSCode; e De native Expert in Carmel

NOTE: It's possible that these characters are font-specific ligatures.

PDF2go correctly identified the text (using an OCR method) without munging any characters. The Definitive Expert in Carmel

PDFCandy returned odd spacing: Th e Defi native Expert in Carmel

PDFForge worked: The Definitive Expert in Carmel

PDFtk has multiple options, but also failed. e De native Expert in Carmel

JamoCA avatar Jun 15 '23 21:06 JamoCA

Interesting - I haven't encountered this before.

I'm planning on upgrading the jar to 2.0.28 soon - wondering if that will make any difference.

If it doesn't, maybe it makes sense to include an option in getText() to strip high ASCII characters.

mjclemente avatar Jun 15 '23 21:06 mjclemente

Should have asked earlier on this - do you have an example pdf with the issue for testing?

mjclemente avatar Jan 04 '24 16:01 mjclemente

I visited your blog and sent a private email with the link to the PDF that initially encountered these issues.

JamoCA avatar Jan 04 '24 17:01 JamoCA