PdfBox-Android
PdfBox-Android copied to clipboard
PDFTextStripperByArea failed extracting text (font problems?)
Hello,
I have tried the library with a lot of documents. It works fine but with some of them I got always a similar message:
11-06 14:59:34.737: E/PDResources(23440): at java.lang.Thread.run(Thread.java:841)
11-06 14:59:34.737: W/PDTrueTypeFont(23440): Using fallback font for ArialMT
11-06 14:59:34.737: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica'
11-06 14:59:34.737: W/PDTrueTypeFont(23440): Using fallback font for Tahoma
11-06 14:59:34.747: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica'
11-06 14:59:34.747: W/PDTrueTypeFont(23440): Using fallback font for ArialNarrow
11-06 14:59:34.747: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica'
11-06 14:59:34.747: W/PDTrueTypeFont(23440): Using fallback font for ArialNarrow-BoldItalic
11-06 14:59:34.747: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica-BoldOblique'
11-06 14:59:34.747: W/System.err(23440): java.io.IOException: Error: Could not find referenced cmap stream Identity-H
11-06 14:59:34.747: W/System.err(23440): at org.apache.fontbox.cmap.CMapParser.getExternalCMap(CMapParser.java:383)
11-06 14:59:34.747: W/System.err(23440): at org.apache.fontbox.cmap.CMapParser.parsePredefined(CMapParser.java:84)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.pdmodel.font.CMapManager.getPredefinedCMap(CmapManager.java:34)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.pdmodel.font.PDType0Font.readEncoding(PDType0Font.java:71)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:48)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:73)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:172)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.PDFStreamEngine.getFonts(PDFStreamEngine.java:503)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:32)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:466)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:220)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:185)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.contentstream.PDFTextStreamEngine.processStream(PDFTextStreamEngine.java:105)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:353)
11-06 14:59:34.747: W/System.err(23440): at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:102)
Sometimes is this message, and sometimes is "No fallback font for 'Helvetica'" or another font type
Is this a problem with this library, or maybe with PDFBox?
The code I have used is this:
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
[...]
stripper.addRegion(k, regions_hash.get(k));
[...]
regions=stripper.getRegions();
for(String region: regions) {
String textForRegion = stripper.getTextForRegion(region);
textForRegion=textForRegion.trim();
if (!textForRegion.isEmpty()) {
outStream.write((textForRegion+' ').getBytes());
}
}
Already working on a fix. It sees to be a bug in the font system rather than something specific to text extraction or any other action
Yesterday I was desperate and I downloaded the code to see if I could do something to make it work. I changed a pair of lines and the "Could not find referenced cmap stream Identity-H" error doesn't appear anymore.
I have changed this in CMapParser.java:
/**
* Returns an input stream containing the given "use" CMap.
*/
protected InputStream getExternalCMap(String name) throws IOException
{
String path;
URL url = getClass().getResource(name);
//XXX Code changed
if (url == null) {
path="/org/apache/pdfbox/resources/cmap/";
url = getClass().getResource(path+name);
}
if (url == null)
{
throw new IOException("Error: Could not find referenced cmap stream " + name);
}
return url.openStream();
}
It seems that the library doesn't find the resources folder in the proyect.
I'm investigating the fallback font problem too. It's related to the system font that Android uses by default (some of the fonts defined in the code don't exists in Android)
Does your code work without fixing the fonts issue? It's popped up with other code before, but didn't impact the functionality at all.
I thinks it works, I don't have the "Could not find referenced cmap stream Identity-H" error now. My app recognizes most of the documents with this fix.
Are there still documents that don't work? Would you mind sharing an example pdf if you can?
I also have problems with this document: https://drive.google.com/file/d/0B-t3Zj2dsa4AZng3c2t5d05CV0E/view?usp=sharing
The error is:
11-07 23:38:21.405: W/System.err(18854): java.lang.IllegalStateException: No fonts available on the system for Helvetica
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.pdmodel.font.ExternalFonts.getType1FallbackFont(ExternalFonts.java:256)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:190)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:49)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:172)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.PDFStreamEngine.getFonts(PDFStreamEngine.java:503)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:32)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:466)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:220)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:185)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.contentstream.PDFTextStreamEngine.processStream(PDFTextStreamEngine.java:105)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:373)
11-07 23:38:21.405: W/System.err(18854): at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:102)
I have done a lot of changes and I fixed the problems with the last file. Here are all the changes:
First I noticed that every system has its own fonts path to get the fonts, and consequently, I have defined a new finder for android:
org.apache.fontbox.util.autodetect FontFileFinder.java:57
if (System.getProperty("java.vendor")=="The Android Project")
return new AndroidFontDirFinder();
else
return new UnixFontDirFinder();
New finder class (based on UnixFontDirFinder)
org.apache.fontbox.util.autodetect AndroidFontDirFinder.java
package org.apache.fontbox.util.autodetect;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
public class AndroidFontDirFinder extends NativeFontDirFinder
{
/**
* @return a list of possible font locations
*/
protected String[] getSearchableDirectories()
{
return new String[] {
"/system/fonts" // system fonts
};
}
/**
* {@inheritDoc}
*/
public Map<String, String> getCommonTTFMapping()
{
HashMap<String,String> map = new HashMap<String,String>();
map.put("TimesNewRoman,BoldItalic","DroidSerif-BoldItalic");
map.put("TimesNewRoman,Bold","DroidSerif-Bold");
map.put("TimesNewRoman,Italic","DroidSerif-Italic");
map.put("TimesNewRoman","DroidSerif-Regular");
map.put("Arial,BoldItalic","Roboto-BoldItalic");
map.put("Arial,Italic","Roboto-Italic");
map.put("Arial,Bold","Roboto-Bold");
map.put("Arial","Roboto-Regular");
map.put("Courier,BoldItalic","DroidSansMono");
map.put("Courier,Italic","DroidSansMono");
map.put("Courier,Bold","DroidSansMono");
map.put("Courier","DroidSansMono");
map.put("Symbol", "OpenSymbol");
map.put("ZapfDingbats", "Dingbats");
return Collections.unmodifiableMap(map);
}
}
Last but not least, I have changed the substitutes to the new Android fonts based on (https://github.com/android/platform_frameworks_base/blob/master/data/fonts/system_fonts.xml)
org.apache.pdfbox.pdmodel.font ExternalFonts.java:103
/** Map of PostScript name substitutes, in priority order. */
private final static Map<String, List<String>> substitutes = new HashMap<String, List<String>>();
static
{
//XXX Add Android Font substitutes
// substitutes for standard 14 fonts
substitutes.put("Courier",
Arrays.asList("CourierNew", "CourierNewPSMT", "LiberationMono", "NimbusMonL-Regu","DroidSansMono"));
substitutes.put("Courier-Bold",
Arrays.asList("CourierNewPS-BoldMT", "CourierNew-Bold", "LiberationMono-Bold",
"NimbusMonL-Bold","DroidSansMono"));
substitutes.put("Courier-Oblique",
Arrays.asList("CourierNewPS-ItalicMT","CourierNew-Italic",
"LiberationMono-Italic", "NimbusMonL-ReguObli","DroidSansMono"));
substitutes.put("Courier-BoldOblique",
Arrays.asList("CourierNewPS-BoldItalicMT","CourierNew-BoldItalic",
"LiberationMono-BoldItalic", "NimbusMonL-BoldObli","DroidSansMono"));
substitutes.put("Helvetica",
Arrays.asList("ArialMT", "Arial", "LiberationSans", "NimbusSanL-Regu","Roboto-Regular"));
substitutes.put("Helvetica-Bold",
Arrays.asList("Arial-BoldMT", "Arial-Bold", "LiberationSans-Bold",
"NimbusSanL-Bold","Roboto-Bold"));
substitutes.put("Helvetica-Oblique",
Arrays.asList("Arial-ItalicMT", "Arial-Italic", "Helvetica-Italic",
"LiberationSans-Italic", "NimbusSanL-ReguItal", "Roboto-Italic"));
substitutes.put("Helvetica-BoldOblique",
Arrays.asList("Arial-BoldItalicMT", "Helvetica-BoldItalic",
"LiberationSans-BoldItalic", "NimbusSanL-BoldItal","Roboto-BoldItalic"));
substitutes.put("Times-Roman",
Arrays.asList("TimesNewRomanPSMT", "TimesNewRoman", "TimesNewRomanPS",
"LiberationSerif", "NimbusRomNo9L-Regu","DroidSerif-Regular"));
substitutes.put("Times-Bold",
Arrays.asList("TimesNewRomanPS-BoldMT", "TimesNewRomanPS-Bold",
"TimesNewRoman-Bold", "LiberationSerif-Bold",
"NimbusRomNo9L-Medi", "DroidSerif-Bold"));
substitutes.put("Times-Italic",
Arrays.asList("TimesNewRomanPS-ItalicMT", "TimesNewRomanPS-Italic",
"TimesNewRoman-Italic", "LiberationSerif-Italic",
"NimbusRomNo9L-ReguItal","DroidSerif-Italic"));
substitutes.put("Times-BoldItalic",
Arrays.asList("TimesNewRomanPS-BoldItalicMT", "TimesNewRomanPS-BoldItalic",
"TimesNewRoman-BoldItalic", "LiberationSerif-BoldItalic",
"NimbusRomNo9L-MediItal","DroidSerif-BoldItalic"));
substitutes.put("Symbol", Arrays.asList("SymbolMT", "StandardSymL"));
substitutes.put("ZapfDingbats", Arrays.asList("ZapfDingbatsITC", "Dingbats"));
// extra substitute mechanism for CJK CIDFonts when all we know is the ROS
substitutes.put("$Adobe-CNS1", Arrays.asList("AdobeMingStd-Light"));
substitutes.put("$Adobe-Japan1", Arrays.asList("KozMinPr6N-Regular"));
substitutes.put("$Adobe-Korea1", Arrays.asList("AdobeGothicStd-Bold"));
substitutes.put("$Adobe-GB1", Arrays.asList("AdobeHeitiStd-Regular"));
There are some errors with other files yet. I let you this pair of documents that have different kind of errors: https://drive.google.com/file/d/0B-t3Zj2dsa4ATUw0NGhWWWtJRGc/view https://drive.google.com/file/d/0B-t3Zj2dsa4ATVRYTFdqNnZfbjQ/view?usp=sharing
In the first example, the library doesn't extract any text (but it doesn't show any error, only "Using fallback font" warnings) and the second example shows this error:
11-08 00:34:30.742: E/ExternalFonts(21623): No TTF fallback font for 'Times-Roman'
It's definitely an issue with PDFTextStripperByArea because using PDFTextStripper extracts the text just fine.
Hi! I'm getting no TTF fallback warnings
01-27 10:06:50.207 31485-1570/? E/ExternalFonts﹕ No TTF fallback font for 'Times-Roman'
And I get this error with 1.8.8:
Could not find referenced cmap stream Identity-H
This is the code I made:
File pdfFile = new File(mBookPath);
PDFTextStripper textStripper = new PDFTextStripper();;
PDDocument pdDoc = PDDocument.load(pdfFile);
textStripper.setStartPage(1);
textStripper.setEndPage(pdDoc.getNumberOfPages());
String data = textStripper.getText(pdDoc);
So seems that PDFTextStripper fails too.
Are any of fixes proposed by @RainHeart257 merged in current version of PdfBox-Android?
This is the book that fails: https://www.dropbox.com/s/urlhav4ze66kzmh/coffeescript%20copia.pdf?dl=0
Another books that fails with 'Could not find referenced cmap stream Identity-H':
- https://www.dropbox.com/s/8k80rri2jj7stii/focus-en-espac3b1ol29.pdf?dl=0
- https://www.dropbox.com/s/4vm886itf976c7x/Referencia%20de%20plugins.pdf?dl=0
And this wants never ends to extract code, but I don't get errors.
- https://www.dropbox.com/s/vsk2oaqjcv7l28j/Remote_Control_Devices_es.pdf?dl=0
TTF fallback warnings should be fixed for all the fonts except for Times-Roman, I'm not sure why that font isn't. As far as I've been able to tell, it hasn't had any accept on function so far.
Identity-H was from something I messed up, ~~It'll be fixed in the next update.~~ Fixed
The TextStripper classes have issues that need to be worked on, and I'm not sure how long it will take to get it fixed. If you can, try some of the older jars. They may have TextStrippers that will work better for your pdfs.