tabulapdf
tabulapdf copied to clipboard
Pdf ideas for examples
- Scientific papers often have tables and one would surely like to use the
areaargument. - Bus timetables, e.g. http://www.apsrtc.gov.in/Airport%20Liner%20Timings.pdf or http://www.morbihan.fr/fileadmin/Les_services/Vos_deplacements/Transports_collectifs/Fiches_horaires_TIM/TIM7-Hiver-Printemps-2016.pdf p.3
The area argument is available. For example:
extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 550, 50)))
The area parameter appears to take co-ordinates in the form: top, left, width, height.
You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.
However, the tabula app console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.
@psychemedia The area specification is a bug in my code. I'm pushing a fix for it right now. It should be top,left,bottom,right just like in Tabula.
@leeper new terrible example, http://photos.state.gov/libraries/india/231771/PDFs/jan-dec_2015.pdf (the csv here being incomplete). It's US data, 187 pages, I'll report tomorrow once I've scraped it. Have I already said your pkg is awesome? :grin:
I have used the tabulizer package here https://github.com/masalmon/usaqmindia/blob/master/inst/pm25_consulate.R but it's a pretty boring example.
This IRS document might work well as an example: https://www.irs.gov/pub/irs-soi/14databk.pdf
> extract_areas(tmp, pages = c(14, 15, 17, 18), method = "data.frame")
> str(.Last.value)
List of 6
$ :'data.frame': 54 obs. of 8 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.1.: chr [1:54] "239,874,741 " "3,074,293 " "584,480 " "4,485,975 " ...
..$ X.2.: chr [1:54] "2,220,921 " "17,613 " "3,362 " "33,844 " ...
..$ X.3.: chr [1:54] "4,642,817 " "50,438 " "9,160 " "83,945 " ...
..$ X.4.: chr [1:54] "3,799,428 " "45,905 " "7,383 " "84,956 " ...
..$ X.5.: chr [1:54] "147,444,789 " "2,048,463 " "357,733 " "2,805,861 " ...
..$ X.6.: chr [1:54] "23,608,340 " "252,431 " "47,482 " "430,138 " ...
..$ X.7.: chr [1:54] "3,205,595" "29,602" "4,178" "49,609" ...
$ :'data.frame': 54 obs. of 8 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.8. : chr [1:54] "617,649 " "5,365 " "1,067 " "7,563 " ...
..$ X.9. : chr [1:54] "30,065,749 " "353,564 " "79,939 " "508,257 " ...
..$ X.10.: chr [1:54] "34,132 " "255 " "38 " "410 " ...
..$ X.11.: chr [1:54] "334,641 " "3,163 " "567 " "4,626 " ...
..$ X.12.: chr [1:54] "987,238 " "15,016 " "3,433 " "9,225 " ...
..$ X.13.: chr [1:54] "1,467,402 " "16,792 " "4,682 " "19,344 " ...
..$ X.14.: chr [1:54] "21,446,040" "235,686" "65,456" "448,197" ...
$ :'data.frame': 54 obs. of 7 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.1.: chr [1:54] "157,187,971 " "2,122,412 " "371,057 " "2,939,657 " ...
..$ X.2.: chr [1:54] "1,173,505 " "10,456 " "1,524 " "12,059 " ...
..$ X.3.: chr [1:54] "3,439,645 " "40,500 " "6,851 " "50,573 " ...
..$ X.4.: chr [1:54] "2,813,102 " "36,809 " "5,205 " "49,203 " ...
..$ X.5.: chr [1:54] "124,585,594 " "1,785,868 " "301,830 " "2,339,074 " ...
..$ X.6.: chr [1:54] "47,309,667" "612,321" "151,349" "977,840" ...
$ :'data.frame': 54 obs. of 8 variables:
..$ X : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
..$ X.7. : chr [1:54] "3,261,248 " "39,515 " "6,909 " "64,940 " ...
..$ X.8. : chr [1:54] "77,275,927 " "1,173,547 " "150,481 " "1,361,234 " ...
..$ X.9. : chr [1:54] "2,334,249 " "21,674 " "2,840 " "35,140 " ...
..$ X.10.: chr [1:54] "9,615,578 " "66,424 " "11,088 " "186,577 " ...
..$ X.11.: chr [1:54] "253,158 " "4,431 " "258 " "2,748 " ...
..$ X.12.: chr [1:54] "837,997 " "11,547 " "2,966 " "11,454 " ...
..$ X.13.: chr [1:54] "12,135,143" "144,703" "38,495" "252,829" ...
I like that it's called data book, hehe.
BTW do you think there would be way to automatically recognize all tables in a pdf?
@masalmon The default behavior of extract_tables() should do this, as long as guess = TRUE.
Ah cool -- sorry I had missed that.