Text extractor with UTF-8
Hi All,
I am trying to read a PDF which has nordic (äöå) characters in it, but I can't seem to get them from a text extractor, they just get replaced with spaces.
Here's the code
byte[] decoded = Base64.getDecoder().decode(<Base64 String representation of PDF>);
ByteArrayInputStream bais = new ByteArrayInputStream(decoded);
PdfReader reader = new PdfReader(bais);
PdfTextExtractor extractor = new PdfTextExtractor(reader);
String result = extractor.getTextFromPage(1);
First few characters of the result are "T m " I was expecting "Tämä"
I've seen a few issues regarding whitespace but not special characters.
Any help would be very gratefully received.
Thanks
Matt
Please share an example PDF for which that happens.
Working on the example without any customer data etc that I can put on here. In the meantime, I have tried to examine as much as I can to understand where the problem is. I have tried getOriginalChars and getOriginalBytes from the resulting PDF string from the text extractor and I can see that already at that point the characters have been remapped to spaces.
Eg: here's the printout of bytes and characters directly from the string extracted: 84:T 32: 109:m 32:
My question is, is it the fonts that matter? Do I need to load the fonts into the factory repository in order for them to be used when the text extractor reads the data? Or am I able to set what font should be used when reading?
Any help would be amazing,
Thanks
Matt
It can be the font but it can also be other things. What happens if you copy the text out of *dobe Reader? Is it correct? And post an example PDF here - I can try with my local version because I remember that I had problems in the Past with UTF-16 BOM encoded PDFs and special characters and text extraction etc. If that is the problem I may be able to provide a patch ...
Finally managed to produce a test file!
I'm wondering if it's the fonts, which are minion pro (Adobe special font that I dont have loaded?) or E X_ CF F_ Arial (No idea what that is, is it part of the Arial family or do i need something special?)
Many thanks
Matt
Ok, I had a look into that file. The PDF font that Tämä is drawn with is an embedded Type 1 font that has no ToUnicode map and also no Encoding entry. Thus, one has to assume a built-in encoding and use that. OpenPDF here uses the StandardEncoding as encoding. It matches somewhat but has many gaps, and it in particular has no code representing an 'ä'.
One might try and retrieve the built-in encoding from the embedded font program to improve this behavior.
I was wondering why it is working for me but then I realized that I am using PdfBox for text extraction 😂