Strange Behavior Page 3 Of This PDF
Please specify whether your issue is about:
- [ ] a possible bug
- [x] a question about package functionality
- [ ] a suggested code or documentation change, improvement to the code, or feature request
Processing a 3 page PDF, a public document that reports daily ballot returns. Identified the areas to process, pages 1-2 work fine, but very strange behavior on page 3. In columns 15-16, information from the first row of information (which should be ignored) has somehow appeared. The numerical values are there, but interspersed with characters.
Assuming that it is the "|" characters in the text fields in the first row of the table that are causing the trouble, but I'm wondering why tabulizer is processing that area of the PDF at all.
Put your code here:
## rJava loads successfully
# install.packages("rJava")
library("rJava")
## load package
library("tabulizer")
## code goes here
# Extract daily Oregon ballot returns from Secy of State Reports
library(tabulizer)
or <- "https://sos.oregon.gov/elections/Documents/unofficial-ballot-return-nov-2018.pdf"
# URL for Oregon daily ballot returns
# Results from manually using `locate_areas(or)` to process desired tables
#[[1]]
#top left bottom right
#174.13308 86.80617 740.82932 524.43009
#[[2]]
#top left bottom right
#67.27869 54.29508 498.09836 736.52459
#[[3]]
#top left bottom right
#59.01639 36.59016 524.06557 755.40984
ballot_returned <- data.frame(extract_tables(or, pages = 1, area = list(c(174, 87, 730, 524 ))))
daily_returns <- data.frame(extract_tables(or, pages = 2, area = list(c(67, 54, 502, 737))))
party_returns <- data.frame(extract_tables(or, pages = 3, area = list(c(59, 36, 525, 756))))
# Strange results in columns 15 and 16, somehow capturing information from the first row
## session info for your system
sessionInfo()
Session Info:
R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6
Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1 yaml_2.2.0
Portion of the problematic page. The areas identified should skip the first row.

View(party_returns), X15 and X16 contain stray character information from the first row.
