PdfBox-Android Very slow extracting text

I think I'm missed something, because I cannot think it needs tens of seconds (or even minutes) to extract text. Can you please help me? This is my code (I start calling simpleReadPdf):

        try {
            return PDDocument.load(file);
        } catch(IOException e) {
            // Probable encrypted text
            e.printStackTrace();
            return null;
        }
    }
public static String simpleReadPdf(File file, Context context){
        StringBuffer text = null;
        PDFBoxResourceLoader.init(context);
        PDDocument document = FileReaderUtils.getPdfDoc(file, context);

        try {
            text = extractTextFromPDF(document);
        } catch (IOException ioe){
            // Probable encrypted text
            ioe.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (document != null) document.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return text.toString();
    }

public static String extractTextFromPDF(PDDocument doc) throws IOException
    {
        String dataS = null;
        try
        {
            PDFTextStripper textStripper = new PDFTextStripper();
            textStripper.setStartPage(1);
            textStripper.setEndPage(3);
            dataS = textStripper.getText(doc);
        }
        finally
        {
            if (doc != null) doc.close();
        }
       return dataS;
}

Dec 26 '17 15:12 adepase

What are the specs of device are you testing on? Stripping text is slow, but with the amount of pages you're stripping, it shouldn't be taking minutes to strip.

Jan 03 '18 06:01 TomRoush

I'm testing on a Samsung s7 edge (so, a pretty good hardware) and trying to strip the attached pdf. 5 seconds only for pages from 0 to 3 (BTW: according to the docs it should start with 1 and be inclusive, but if I start from 1 I miss the front page) confessioni[1].pdf

Thank you

Jan 03 '18 09:01 adepase

5 seconds for those pages is about what I would expect and similar to my time. As I mentioned before, text stripping is slow. The start page should be 1-indexed as you said, I'll look into why it's 0-indexed.

Jan 31 '18 19:01 TomRoush

extracting text from sample is fast whereas when i use it in my own app it slows down massively any idea to fix it ?

Thank you for this library.

Mar 11 '18 18:03 arsh-7

@pro-preet Its slower using the same PDF stripping in your code than it is stripping from sample?

Apr 15 '18 00:04 TomRoush

I have a similar issue - text extracting is very slow, but only if the phone (S7) is connected with PC. (tested with small PDF, 20 words only...)

Oct 01 '18 11:10 mobilecityCZ

+1

Jan 07 '19 15:01 jenmo917

Was looking to easily extract text, but a single page PDF with 35 lines of actual content, takes 20s or so on a fairly recent (Nokia 8.1 Android 10) device. Did not expect that.

I was expecting that the text is already present in the PDF format, so it's just a simple extraction? Apparently not?

update If you are looking to extract text in sub 1s time, I just found https://github.com/benjinus/android-support-pdfium which works very fast.

Dec 12 '19 14:12 peterdk

Try using thread to get data before its needed. you have to design an algorithm for when you want/need the data. hope it helps!

Jul 22 '20 04:07 pranayzv

PdfBox-Android PdfBox-Android copied to clipboard

Very slow extracting text

PdfBox-Android
PdfBox-Android copied to clipboard