PdfBox-Android icon indicating copy to clipboard operation
PdfBox-Android copied to clipboard

Very slow extracting text

Open adepase opened this issue 7 years ago • 9 comments

I think I'm missed something, because I cannot think it needs tens of seconds (or even minutes) to extract text. Can you please help me? This is my code (I start calling simpleReadPdf):

        try {
            return PDDocument.load(file);
        } catch(IOException e) {
            // Probable encrypted text
            e.printStackTrace();
            return null;
        }
    }
public static String simpleReadPdf(File file, Context context){
        StringBuffer text = null;
        PDFBoxResourceLoader.init(context);
        PDDocument document = FileReaderUtils.getPdfDoc(file, context);

        try {
            text = extractTextFromPDF(document);
        } catch (IOException ioe){
            // Probable encrypted text
            ioe.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (document != null) document.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return text.toString();
    }

public static String extractTextFromPDF(PDDocument doc) throws IOException
    {
        String dataS = null;
        try
        {
            PDFTextStripper textStripper = new PDFTextStripper();
            textStripper.setStartPage(1);
            textStripper.setEndPage(3);
            dataS = textStripper.getText(doc);
        }
        finally
        {
            if (doc != null) doc.close();
        }
       return dataS;
}

adepase avatar Dec 26 '17 15:12 adepase

What are the specs of device are you testing on? Stripping text is slow, but with the amount of pages you're stripping, it shouldn't be taking minutes to strip.

TomRoush avatar Jan 03 '18 06:01 TomRoush

I'm testing on a Samsung s7 edge (so, a pretty good hardware) and trying to strip the attached pdf. 5 seconds only for pages from 0 to 3 (BTW: according to the docs it should start with 1 and be inclusive, but if I start from 1 I miss the front page) confessioni[1].pdf

Thank you

adepase avatar Jan 03 '18 09:01 adepase

5 seconds for those pages is about what I would expect and similar to my time. As I mentioned before, text stripping is slow. The start page should be 1-indexed as you said, I'll look into why it's 0-indexed.

TomRoush avatar Jan 31 '18 19:01 TomRoush

extracting text from sample is fast whereas when i use it in my own app it slows down massively any idea to fix it ?

Thank you for this library.

arsh-7 avatar Mar 11 '18 18:03 arsh-7

@pro-preet Its slower using the same PDF stripping in your code than it is stripping from sample?

TomRoush avatar Apr 15 '18 00:04 TomRoush

I have a similar issue - text extracting is very slow, but only if the phone (S7) is connected with PC. (tested with small PDF, 20 words only...)

mobilecityCZ avatar Oct 01 '18 11:10 mobilecityCZ

+1

jenmo917 avatar Jan 07 '19 15:01 jenmo917

Was looking to easily extract text, but a single page PDF with 35 lines of actual content, takes 20s or so on a fairly recent (Nokia 8.1 Android 10) device. Did not expect that.

I was expecting that the text is already present in the PDF format, so it's just a simple extraction? Apparently not?

update If you are looking to extract text in sub 1s time, I just found https://github.com/benjinus/android-support-pdfium which works very fast.

peterdk avatar Dec 12 '19 14:12 peterdk

Try using thread to get data before its needed. you have to design an algorithm for when you want/need the data. hope it helps!

pranayzv avatar Jul 22 '20 04:07 pranayzv