PDFLayoutTextStripper icon indicating copy to clipboard operation
PDFLayoutTextStripper copied to clipboard

Error String index out of range: -1 in PDFLayoutTextStripper

Open Jaumexr opened this issue 5 years ago • 2 comments

Hi, Hi have this code, with attached PDF to test. public void doStrip() { String string = null; try { PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File("D:/escaner/errorsPDFBOX/AN20-0149-0602201842.pdf"), "r")); pdfParser.parse(); PDDocument pdDocument = new PDDocument(pdfParser.getDocument()); PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper(); string = pdfTextStripper.getText(pdDocument); BufferedWriter writer = Files.newBufferedWriter(FileSystems.getDefault().getPath("D:/escaner","fichero.txt"), Charset.forName("UTF-8")); writer.write(string); writer.flush(); writer.close(); } catch (InvalidPasswordException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }

AN20-0149-0602201842.pdf I have this exception error: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.charAt(String.java:658) at com.sagedillepasa.gestion.TextLine.isSpaceCharacterAtIndex(PDFLayoutTextStripper.java:269) at com.sagedillepasa.gestion.TextLine.getNextValidIndex(PDFLayoutTextStripper.java:283) at com.sagedillepasa.gestion.TextLine.computeIndexForCharacter(PDFLayoutTextStripper.java:263) at com.sagedillepasa.gestion.TextLine.writeCharacterAtIndex(PDFLayoutTextStripper.java:229) at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeLine(PDFLayoutTextStripper.java:127) at com.sagedillepasa.gestion.PDFLayoutTextStripper.writeTextPositionList(PDFLayoutTextStripper.java:157) at com.sagedillepasa.gestion.PDFLayoutTextStripper.iterateThroughTextList(PDFLayoutTextStripper.java:152) at com.sagedillepasa.gestion.PDFLayoutTextStripper.writePage(PDFLayoutTextStripper.java:96) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:392) at com.sagedillepasa.gestion.PDFLayoutTextStripper.processPage(PDFLayoutTextStripper.java:80) at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266) at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227) at com.sagedillepasa.gestion.test.doStrip(test.java:44) at com.sagedillepasa.gestion.test.main(test.java:61)

Jaumexr avatar Feb 21 '20 07:02 Jaumexr

I have the exact same issue with the example code - it doesn't work.

jenka13all avatar Nov 05 '20 13:11 jenka13all

I'm encountering the same issue. The exception seems to happen because index is 0 here so isSpaceCharacterAtIndex is called with -1. Changing the condition to !isCharacterPartOfPreviousWord && index > 0 && this.isSpaceCharacterAtIndex(index - 1) in the condition seems to fix the issue.

Athou avatar Jan 20 '22 08:01 Athou