tabulapdf
tabulapdf copied to clipboard
Can't get the tables from PDF using "extract_tables"
I have below PDF, which seems to have "clean" tables. But extract_tables() gives me an empty list. http://databank.worldbank.org/data/download/GDP.pdf
library(tabulizer) # tabulizer_0.1.24
# read from local PDF file
# myPDF <- extract_tables("GDP.pdf")
# read from link
myPDF <- extract_tables("http://databank.worldbank.org/data/download/GDP.pdf")
length(myPDF)
# [1] 0
I tried to use extract_areas, which works fine.
Any pointers why wouldn't extract_tables work? Maybe I missing some arguments?
> sessionInfo()
R version 3.4.1 Patched (2017-07-04 r72891)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tabulizer_0.1.24
loaded via a namespace (and not attached):
[1] tabulizerjars_0.9.2 compiler_3.4.1 tools_3.4.1 rJava_0.9-8
[5] png_0.1-7
I can reproduce. I wonder if extract_tables gets confused by the header lines. It would be nice if this worked automatically, since the PDF indeed is pretty clean. My guess is that this is an upstream issue (https://github.com/tabulapdf/tabula-java/) but I'd be happy if I were wrong.
I just wanted to note that you could set the area argument of extract_tables. I know that's not ideal, but better than doing it interactively for all of the pages.
There's an update of Tabula that was released last week, which apparently includes a number of fixes and improvements. It is going to require a bit of work to integrate, but I will revisit this once those I have the new version working to see if that solves this.
@leeper @zx8754
Something similar happening with me. I can read the tables in the PDF file but only header values are read and not the table contents.
Any suggestion on how to solve this?
This doc. PDF https://www.qc.cuny.edu/About/Research/Documents/Fact_Book_2014-2015_Final.pdf
my problem is don't extract an table on page 86 while the others pages extract_tables it works normally
Any solution?