PdfBox-Android
PdfBox-Android copied to clipboard
For some particular pdfs, it throws IndexOutOfBoundsException
Describe the bug For some particular pdfs, the PDFTextStripper.getText() throws an exception.
To reproduce Code snippet to reproduce the behavior:
// In onCreate of MainApplication.kt
PDFBoxResourceLoader.init(this)
// In a fragment
val inputStream = context.contentResolver.openInputStream(uri)
val pdDoc = PDDocument.load(inputStream)
val pdfStripper = PDFTextStripper()
val text = pdfStripper.getText(pdDoc)
PDF example HARCOURT Invisible Umpires.pdf
Expected behavior The text should be extracted correctly from the pdf.
Actual behavior By running this code, you get the following exception:
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.get(ArrayList.java:437)
at com.tom_roush.fontbox.cmap.CMapParser.parseBeginbfrange(CMapParser.java:373)
at com.tom_roush.fontbox.cmap.CMapParser.parse(CMapParser.java:137)
at com.tom_roush.pdfbox.pdmodel.font.CMapManager.parseCMap(CmapManager.java:73)
at com.tom_roush.pdfbox.pdmodel.font.PDFont.readCMap(PDFont.java:175)
at com.tom_roush.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:121)
at com.tom_roush.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:86)
at com.tom_roush.pdfbox.pdmodel.font.PDType1CFont.<init>(PDType1CFont.java:74)
at com.tom_roush.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:58)
at com.tom_roush.pdfbox.pdmodel.PDResources.getFont(PDResources.java:122)
at com.tom_roush.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:58)
at com.tom_roush.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:816)
at com.tom_roush.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:473)
at com.tom_roush.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:447)
at com.tom_roush.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:148)
at com.tom_roush.pdfbox.text.PDFTextStreamEngine.processPage(PDFTextStreamEngine.java:141)
at com.tom_roush.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:394)
at com.tom_roush.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:321)
at com.tom_roush.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:268)
at com.tom_roush.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:229)
at com.cliffweitzman.speechify2.repository.LibraryRepository.pendingRecordToText(LibraryRepository.kt:218)
at com.cliffweitzman.speechify2.repository.LibraryRepository$pendingRecordToText$1.invokeSuspend(Unknown Source:15)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)
Environment details:
- PdfBox-Android version: 2.0.2.0
- Android API version: API 30
Please retry with the current version, that bug has been solved a month ago. https://github.com/TomRoush/PdfBox-Android/blame/master/library/src/main/java/com/tom_roush/fontbox/cmap/CMapParser.java#L365