tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

unable to extract tables from PDF

Open himsoni-cloud opened this issue 5 years ago • 2 comments

Hi

I am getting t subprocess error while using tabula-py library to extract tables from PDF. I have coordinated with tabula-py group and they told me "this is not tabula-py's issue but tabula-java's one."

(.venv) ➜  tabula-py git:(master) ✗ java -jar tabula/tabula-1.0.3-jar-with-dependencies.jar Testing.pdf
Error: Error expected floating point number actual='-17.-21823'

Could you please take a look.

attacehed pdf file for your reference Testing.pdf

himsoni-cloud avatar Jan 16 '20 13:01 himsoni-cloud

Made a further investigation and this error came from pdfbox, which tabula-java depends on. So, it'd be better to raise an issue on PDFBox.

java -jar pdfbox-app-2.0.18.jar ExtractText Testing.pdf
Exception in thread "main" java.io.IOException: Error expected floating point number actual='-17.-21823'
	at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:78)
	at org.apache.pdfbox.cos.COSNumber.get(COSNumber.java:115)
	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:952)
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:154)
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryNameValuePair(BaseParser.java:283)
	at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:216)
	at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:867)
	at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:917)
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:886)
	at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:806)
	at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:766)
	at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:187)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:226)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1099)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1082)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1023)
	at org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:218)
	at org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:97)
	at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
Caused by: java.lang.NumberFormatException
	at java.math.BigDecimal.<init>(BigDecimal.java:494)
	at java.math.BigDecimal.<init>(BigDecimal.java:383)
	at java.math.BigDecimal.<init>(BigDecimal.java:806)
	at org.apache.pdfbox.cos.COSFloat.<init>(COSFloat.java:59)
	... 18 more

chezou avatar Jan 18 '20 09:01 chezou

Your PDF contains this font descriptor object:

17 0 obj
<</Ascent 891 /CapHeight 662 /Descent -216 /Flags 32 /FontBBox
  [-497 -306 1120 1023] /FontFile2 18 0 R /FontName
  /AFPTimesNewRoman-Italic /ItalicAngle -17.-21823 /StemV 80 /Type
  /FontDescriptor /XHeight 441>>
endobj

According to the PDF specification the ItalicAngle must be a number. -17.-21823 is not a valid number representation. PDF parsers which don't do repairs under the hood, therefore, most likely will fail reading your file. PDFBox does fail.

mkl-public avatar Jan 20 '20 09:01 mkl-public