tabulapdf Can't get the tables from PDF using "extract

trafficstars

I have below PDF, which seems to have "clean" tables. But extract_tables() gives me an empty list. http://databank.worldbank.org/data/download/GDP.pdf

library(tabulizer) # tabulizer_0.1.24

# read from local PDF file
# myPDF <- extract_tables("GDP.pdf")

# read from link
myPDF <- extract_tables("http://databank.worldbank.org/data/download/GDP.pdf")

length(myPDF)
# [1] 0

I tried to use extract_areas, which works fine.

Any pointers why wouldn't extract_tables work? Maybe I missing some arguments?

> sessionInfo()
R version 3.4.1 Patched (2017-07-04 r72891)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulizer_0.1.24

loaded via a namespace (and not attached):
[1] tabulizerjars_0.9.2 compiler_3.4.1      tools_3.4.1         rJava_0.9-8        
[5] png_0.1-7

Jul 21 '17 10:07 zx8754

I can reproduce. I wonder if extract_tables gets confused by the header lines. It would be nice if this worked automatically, since the PDF indeed is pretty clean. My guess is that this is an upstream issue (https://github.com/tabulapdf/tabula-java/) but I'd be happy if I were wrong.

I just wanted to note that you could set the area argument of extract_tables. I know that's not ideal, but better than doing it interactively for all of the pages.

Jul 28 '17 21:07 scottkosty

There's an update of Tabula that was released last week, which apparently includes a number of fixes and improvements. It is going to require a bit of work to integrate, but I will revisit this once those I have the new version working to see if that solves this.