tabulapdf icon indicating copy to clipboard operation
tabulapdf copied to clipboard

Strange Behavior Page 3 Of This PDF

Open paulgronke opened this issue 7 years ago • 0 comments

Please specify whether your issue is about:

  • [ ] a possible bug
  • [x] a question about package functionality
  • [ ] a suggested code or documentation change, improvement to the code, or feature request

Processing a 3 page PDF, a public document that reports daily ballot returns. Identified the areas to process, pages 1-2 work fine, but very strange behavior on page 3. In columns 15-16, information from the first row of information (which should be ignored) has somehow appeared. The numerical values are there, but interspersed with characters.

Assuming that it is the "|" characters in the text fields in the first row of the table that are causing the trouble, but I'm wondering why tabulizer is processing that area of the PDF at all.

Put your code here:

## rJava loads successfully
# install.packages("rJava")
library("rJava")

## load package
library("tabulizer")

## code goes here
# Extract daily Oregon ballot returns from Secy of State Reports

library(tabulizer)


or <- "https://sos.oregon.gov/elections/Documents/unofficial-ballot-return-nov-2018.pdf" 
          # URL for Oregon daily ballot returns

# Results from manually using `locate_areas(or)` to process desired tables

#[[1]]
#top      left    bottom     right 
#174.13308  86.80617 740.82932 524.43009 

#[[2]]
#top      left    bottom     right 
#67.27869  54.29508 498.09836 736.52459 

#[[3]]
#top      left    bottom     right 
#59.01639  36.59016 524.06557 755.40984 


ballot_returned <- data.frame(extract_tables(or, pages = 1, area = list(c(174, 87, 730, 524 ))))

daily_returns <- data.frame(extract_tables(or, pages = 2, area = list(c(67, 54, 502, 737))))

party_returns <- data.frame(extract_tables(or, pages = 3, area = list(c(59, 36, 525, 756))))

# Strange results in columns 15 and 16, somehow capturing information from the first row

## session info for your system
sessionInfo()

Session Info:

R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1 yaml_2.2.0

Portion of the problematic page. The areas identified should skip the first row.

screen shot 2018-10-31 at 10 47 57 am

View(party_returns), X15 and X16 contain stray character information from the first row.

screen shot 2018-10-31 at 10 48 08 am

paulgronke avatar Oct 31 '18 17:10 paulgronke