tabulapdf icon indicating copy to clipboard operation
tabulapdf copied to clipboard

inconsistent behavior of extract_tables and extract_areas

Open datapumpernickel opened this issue 2 years ago • 0 comments

Please specify whether your issue is about:

  • [x] a possible bug

First: Thank you very much for this awesome package. It has saved me tremendous headaches in the past!

Now I have a weird behavior, that I cannot really wrap my head around. When I do extract_areas() and locate the table, the result looks fine - I get back a complete table in the usual format. When I do extract_tables() with the exact same area specified, the result is only list(). I do not understand, why one returns the table and the other does not. I would appreciate your input!

Thanks in advance.

Put your code here:

## rJava loads successfully
# install.packages("rJava")
library("rJava")
library("tidyverse")

## load package
library("tabulizer")


httr::GET(
  "https://www.bmwi.de/Redaktion/DE/Publikationen/Aussenwirtschaft/ruestungsexportbericht-2019.pdf?__blob=publicationFile",
  httr::write_disk("temp.pdf")
)

tabulizer::extract_areas("temp.pdf",
                         pages = 82) %>%
  as.data.frame()

tabulizer::extract_tables("temp.pdf",
                         pages = 82)

locate_areas("temp.pdf",
             pages = 82)

tabulizer::extract_tables("temp.pdf",
                         pages = 82,
                         area = list(c(169.78232,  32.63903, 735.16167, 551.83787))) 


## session info for your system
sessionInfo()

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rJava_1.0-4       rvest_1.0.1       jsonlite_1.7.2    httr_1.4.2        shiny_1.7.0       pdftools_3.0.1    tabulizer_0.2.2  
 [8] SWPcdR_0.0.0.9000 extrafont_0.17    janitor_2.1.0     forcats_0.5.1     stringr_1.4.0     dplyr_1.0.7       purrr_0.3.4      
[15] readr_2.0.1       tidyr_1.1.4       tibble_3.1.4      ggplot2_3.3.5     tidyverse_1.3.1   pacman_0.5.1     

loaded via a namespace (and not attached):
 [1] fs_1.5.0            sf_1.0-2            lubridate_1.7.10    tools_4.1.1         padr_0.6.0          backports_1.2.1    
 [7] bslib_0.3.0         utf8_1.2.2          R6_2.5.1            KernSmooth_2.23-20  DBI_1.1.1           colorspace_2.0-2   
[13] withr_2.4.3         sp_1.4-5            tidyselect_1.1.1    curl_4.3.2          compiler_4.1.1      extrafontdb_1.0    
[19] cli_3.0.1           xml2_1.3.2          sass_0.4.0          scales_1.1.1        classInt_0.4-3      proxy_0.4-26       
[25] askpass_1.1         digest_0.6.27       pkgconfig_2.0.3     htmltools_0.5.2     dbplyr_2.1.1        fastmap_1.1.0      
[31] rlang_0.4.11        readxl_1.3.1        rstudioapi_0.13     jquerylib_0.1.4     generics_0.1.1      magrittr_2.0.1     
[37] Rcpp_1.0.7          munsell_0.5.0       fansi_0.5.0         lifecycle_1.0.1     stringi_1.7.4       snakecase_0.11.0   
[43] grid_4.1.1          promises_1.2.0.1    crayon_1.4.2        miniUI_0.1.1.1      lattice_0.20-44     haven_2.4.3        
[49] hms_1.1.1           pillar_1.6.4        reprex_2.0.1        glue_1.4.2          qpdf_1.1            modelr_0.1.8       
[55] tabulizerjars_1.0.1 selectr_0.4-2       png_0.1-7           vctrs_0.3.8         tzdb_0.1.2          httpuv_1.6.3       
[61] Rttf2pt1_1.3.9      cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1    cachem_1.0.6        mime_0.12          
[67] xtable_1.8-4        broom_0.7.10        countrycode_1.3.0   e1071_1.7-8         rnaturalearth_0.1.0 later_1.3.0        
[73] class_7.3-19        giscoR_0.2.4        units_0.7-2         writexl_1.4.0       ellipsis_0.3.2   

datapumpernickel avatar Mar 04 '22 13:03 datapumpernickel