PdfBox-Android icon indicating copy to clipboard operation
PdfBox-Android copied to clipboard

PDFTextStripperByArea failed extracting text (font problems?)

Open roberto-naharro opened this issue 10 years ago • 11 comments

Hello,

I have tried the library with a lot of documents. It works fine but with some of them I got always a similar message:

11-06 14:59:34.737: E/PDResources(23440):   at java.lang.Thread.run(Thread.java:841)
11-06 14:59:34.737: W/PDTrueTypeFont(23440): Using fallback font for ArialMT
11-06 14:59:34.737: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica'
11-06 14:59:34.737: W/PDTrueTypeFont(23440): Using fallback font for Tahoma
11-06 14:59:34.747: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica'
11-06 14:59:34.747: W/PDTrueTypeFont(23440): Using fallback font for ArialNarrow
11-06 14:59:34.747: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica'
11-06 14:59:34.747: W/PDTrueTypeFont(23440): Using fallback font for ArialNarrow-BoldItalic
11-06 14:59:34.747: E/ExternalFonts(23440): No TTF fallback font for 'Helvetica-BoldOblique'
11-06 14:59:34.747: W/System.err(23440): java.io.IOException: Error: Could not find referenced cmap stream Identity-H
11-06 14:59:34.747: W/System.err(23440):    at org.apache.fontbox.cmap.CMapParser.getExternalCMap(CMapParser.java:383)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.fontbox.cmap.CMapParser.parsePredefined(CMapParser.java:84)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.pdmodel.font.CMapManager.getPredefinedCMap(CmapManager.java:34)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.pdmodel.font.PDType0Font.readEncoding(PDType0Font.java:71)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:48)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:73)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:172)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.PDFStreamEngine.getFonts(PDFStreamEngine.java:503)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:32)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:466)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:220)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:185)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.contentstream.PDFTextStreamEngine.processStream(PDFTextStreamEngine.java:105)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:353)
11-06 14:59:34.747: W/System.err(23440):    at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:102)

Sometimes is this message, and sometimes is "No fallback font for 'Helvetica'" or another font type

Is this a problem with this library, or maybe with PDFBox?

The code I have used is this:

PDFTextStripperByArea stripper = new PDFTextStripperByArea();

[...]
stripper.addRegion(k, regions_hash.get(k));
[...]

regions=stripper.getRegions();  
for(String region: regions) {
    String textForRegion = stripper.getTextForRegion(region);
    textForRegion=textForRegion.trim();

    if (!textForRegion.isEmpty()) {
        outStream.write((textForRegion+' ').getBytes());
    }
}

roberto-naharro avatar Nov 06 '14 14:11 roberto-naharro

Already working on a fix. It sees to be a bug in the font system rather than something specific to text extraction or any other action

TomRoush avatar Nov 07 '14 00:11 TomRoush

Yesterday I was desperate and I downloaded the code to see if I could do something to make it work. I changed a pair of lines and the "Could not find referenced cmap stream Identity-H" error doesn't appear anymore.

I have changed this in CMapParser.java:

/**
 * Returns an input stream containing the given "use" CMap.
 */
protected InputStream getExternalCMap(String name) throws IOException
{
    String path;

    URL url = getClass().getResource(name);
    //XXX Code changed
    if (url == null) {
        path="/org/apache/pdfbox/resources/cmap/";
        url = getClass().getResource(path+name);
    }
    if (url == null)
    {
        throw new IOException("Error: Could not find referenced cmap stream " + name);
    }
    return url.openStream();
}

It seems that the library doesn't find the resources folder in the proyect.

I'm investigating the fallback font problem too. It's related to the system font that Android uses by default (some of the fonts defined in the code don't exists in Android)

roberto-naharro avatar Nov 07 '14 11:11 roberto-naharro

Does your code work without fixing the fonts issue? It's popped up with other code before, but didn't impact the functionality at all.

TomRoush avatar Nov 07 '14 17:11 TomRoush

I thinks it works, I don't have the "Could not find referenced cmap stream Identity-H" error now. My app recognizes most of the documents with this fix.

roberto-naharro avatar Nov 07 '14 17:11 roberto-naharro

Are there still documents that don't work? Would you mind sharing an example pdf if you can?

TomRoush avatar Nov 07 '14 22:11 TomRoush

I also have problems with this document: https://drive.google.com/file/d/0B-t3Zj2dsa4AZng3c2t5d05CV0E/view?usp=sharing

The error is:

11-07 23:38:21.405: W/System.err(18854): java.lang.IllegalStateException: No fonts available on the system for Helvetica
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.pdmodel.font.ExternalFonts.getType1FallbackFont(ExternalFonts.java:256)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:190)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:49)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:172)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.PDFStreamEngine.getFonts(PDFStreamEngine.java:503)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:32)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:466)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:220)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processSubStream(PDFStreamEngine.java:185)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.contentstream.PDFTextStreamEngine.processStream(PDFTextStreamEngine.java:105)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:373)
11-07 23:38:21.405: W/System.err(18854):    at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:102)

roberto-naharro avatar Nov 07 '14 22:11 roberto-naharro

I have done a lot of changes and I fixed the problems with the last file. Here are all the changes:

First I noticed that every system has its own fonts path to get the fonts, and consequently, I have defined a new finder for android:

org.apache.fontbox.util.autodetect FontFileFinder.java:57

if (System.getProperty("java.vendor")=="The Android Project")
    return new AndroidFontDirFinder();
else
    return new UnixFontDirFinder();

New finder class (based on UnixFontDirFinder)

org.apache.fontbox.util.autodetect AndroidFontDirFinder.java

package org.apache.fontbox.util.autodetect;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

public class AndroidFontDirFinder extends NativeFontDirFinder
{

    /**
     * @return a list of possible font locations
     */
    protected String[] getSearchableDirectories()
    {
        return new String[] { 
                "/system/fonts" // system fonts
        };
    }

    /**
     * {@inheritDoc}
     */
    public Map<String, String> getCommonTTFMapping()
    {
        HashMap<String,String> map = new HashMap<String,String>();
        map.put("TimesNewRoman,BoldItalic","DroidSerif-BoldItalic");
        map.put("TimesNewRoman,Bold","DroidSerif-Bold");
        map.put("TimesNewRoman,Italic","DroidSerif-Italic");
        map.put("TimesNewRoman","DroidSerif-Regular");

        map.put("Arial,BoldItalic","Roboto-BoldItalic");
        map.put("Arial,Italic","Roboto-Italic");
        map.put("Arial,Bold","Roboto-Bold");
        map.put("Arial","Roboto-Regular");

        map.put("Courier,BoldItalic","DroidSansMono");
        map.put("Courier,Italic","DroidSansMono");
        map.put("Courier,Bold","DroidSansMono");
        map.put("Courier","DroidSansMono");

        map.put("Symbol", "OpenSymbol");
        map.put("ZapfDingbats", "Dingbats");
        return Collections.unmodifiableMap(map);
    }

}

Last but not least, I have changed the substitutes to the new Android fonts based on (https://github.com/android/platform_frameworks_base/blob/master/data/fonts/system_fonts.xml)

org.apache.pdfbox.pdmodel.font ExternalFonts.java:103

/** Map of PostScript name substitutes, in priority order. */
private final static Map<String, List<String>> substitutes = new HashMap<String, List<String>>();
static
{
    //XXX Add Android Font substitutes
    // substitutes for standard 14 fonts
    substitutes.put("Courier",
            Arrays.asList("CourierNew", "CourierNewPSMT", "LiberationMono", "NimbusMonL-Regu","DroidSansMono"));
    substitutes.put("Courier-Bold",
            Arrays.asList("CourierNewPS-BoldMT", "CourierNew-Bold", "LiberationMono-Bold",
                    "NimbusMonL-Bold","DroidSansMono"));
    substitutes.put("Courier-Oblique",
            Arrays.asList("CourierNewPS-ItalicMT","CourierNew-Italic",
                    "LiberationMono-Italic", "NimbusMonL-ReguObli","DroidSansMono"));
    substitutes.put("Courier-BoldOblique",
            Arrays.asList("CourierNewPS-BoldItalicMT","CourierNew-BoldItalic",
                    "LiberationMono-BoldItalic", "NimbusMonL-BoldObli","DroidSansMono"));
    substitutes.put("Helvetica",
            Arrays.asList("ArialMT", "Arial", "LiberationSans", "NimbusSanL-Regu","Roboto-Regular"));
    substitutes.put("Helvetica-Bold",
            Arrays.asList("Arial-BoldMT", "Arial-Bold", "LiberationSans-Bold",
                    "NimbusSanL-Bold","Roboto-Bold"));
    substitutes.put("Helvetica-Oblique",
            Arrays.asList("Arial-ItalicMT", "Arial-Italic", "Helvetica-Italic",
                    "LiberationSans-Italic", "NimbusSanL-ReguItal", "Roboto-Italic"));
    substitutes.put("Helvetica-BoldOblique",
            Arrays.asList("Arial-BoldItalicMT", "Helvetica-BoldItalic",
                    "LiberationSans-BoldItalic", "NimbusSanL-BoldItal","Roboto-BoldItalic"));
    substitutes.put("Times-Roman",
            Arrays.asList("TimesNewRomanPSMT", "TimesNewRoman", "TimesNewRomanPS",
                    "LiberationSerif", "NimbusRomNo9L-Regu","DroidSerif-Regular"));
    substitutes.put("Times-Bold",
            Arrays.asList("TimesNewRomanPS-BoldMT", "TimesNewRomanPS-Bold",
                    "TimesNewRoman-Bold", "LiberationSerif-Bold",
                    "NimbusRomNo9L-Medi", "DroidSerif-Bold"));
    substitutes.put("Times-Italic",
            Arrays.asList("TimesNewRomanPS-ItalicMT", "TimesNewRomanPS-Italic",
                    "TimesNewRoman-Italic", "LiberationSerif-Italic",
                    "NimbusRomNo9L-ReguItal","DroidSerif-Italic"));
    substitutes.put("Times-BoldItalic",
            Arrays.asList("TimesNewRomanPS-BoldItalicMT", "TimesNewRomanPS-BoldItalic",
                    "TimesNewRoman-BoldItalic", "LiberationSerif-BoldItalic",
                    "NimbusRomNo9L-MediItal","DroidSerif-BoldItalic"));
    substitutes.put("Symbol", Arrays.asList("SymbolMT", "StandardSymL"));
    substitutes.put("ZapfDingbats", Arrays.asList("ZapfDingbatsITC", "Dingbats"));

    // extra substitute mechanism for CJK CIDFonts when all we know is the ROS
    substitutes.put("$Adobe-CNS1", Arrays.asList("AdobeMingStd-Light"));
    substitutes.put("$Adobe-Japan1", Arrays.asList("KozMinPr6N-Regular"));
    substitutes.put("$Adobe-Korea1", Arrays.asList("AdobeGothicStd-Bold"));
    substitutes.put("$Adobe-GB1", Arrays.asList("AdobeHeitiStd-Regular"));

There are some errors with other files yet. I let you this pair of documents that have different kind of errors: https://drive.google.com/file/d/0B-t3Zj2dsa4ATUw0NGhWWWtJRGc/view https://drive.google.com/file/d/0B-t3Zj2dsa4ATVRYTFdqNnZfbjQ/view?usp=sharing

In the first example, the library doesn't extract any text (but it doesn't show any error, only "Using fallback font" warnings) and the second example shows this error:

11-08 00:34:30.742: E/ExternalFonts(21623): No TTF fallback font for 'Times-Roman'

roberto-naharro avatar Nov 07 '14 23:11 roberto-naharro

It's definitely an issue with PDFTextStripperByArea because using PDFTextStripper extracts the text just fine.

TomRoush avatar Nov 09 '14 19:11 TomRoush

Hi! I'm getting no TTF fallback warnings

01-27 10:06:50.207 31485-1570/? E/ExternalFonts﹕ No TTF fallback font for 'Times-Roman'

And I get this error with 1.8.8:

Could not find referenced cmap stream Identity-H

This is the code I made:

File pdfFile = new File(mBookPath);

PDFTextStripper textStripper = new PDFTextStripper();;
PDDocument pdDoc = PDDocument.load(pdfFile);

textStripper.setStartPage(1);
textStripper.setEndPage(pdDoc.getNumberOfPages());

String data = textStripper.getText(pdDoc);

So seems that PDFTextStripper fails too.

Are any of fixes proposed by @RainHeart257 merged in current version of PdfBox-Android?

This is the book that fails: https://www.dropbox.com/s/urlhav4ze66kzmh/coffeescript%20copia.pdf?dl=0

rubdottocom avatar Jan 27 '15 09:01 rubdottocom

Another books that fails with 'Could not find referenced cmap stream Identity-H':

  • https://www.dropbox.com/s/8k80rri2jj7stii/focus-en-espac3b1ol29.pdf?dl=0
  • https://www.dropbox.com/s/4vm886itf976c7x/Referencia%20de%20plugins.pdf?dl=0

And this wants never ends to extract code, but I don't get errors.

  • https://www.dropbox.com/s/vsk2oaqjcv7l28j/Remote_Control_Devices_es.pdf?dl=0

rubdottocom avatar Jan 27 '15 09:01 rubdottocom

TTF fallback warnings should be fixed for all the fonts except for Times-Roman, I'm not sure why that font isn't. As far as I've been able to tell, it hasn't had any accept on function so far.

Identity-H was from something I messed up, ~~It'll be fixed in the next update.~~ Fixed

The TextStripper classes have issues that need to be worked on, and I'm not sure how long it will take to get it fixed. If you can, try some of the older jars. They may have TextStrippers that will work better for your pdfs.

TomRoush avatar Jan 30 '15 05:01 TomRoush