tabulapdf
tabulapdf copied to clipboard
locate areas / extract tables renders unexpected results
I have a small issue with the way locate_areas
and extract_tables
interact
I use something like:
areas_to_extract<-locate_areas(PDFfile)
extract_tables(PDFfile, area=areas_to_extract)
areas_to_extract
is a list of length pages, with each position representing a page.
Positions representing pages that I have specified areas for contain coordinates,
while the pages that I have not indicated an area for, are left empty.
When passing the generated list to extract_tables
, empty positions invoke the autodetection algorithm to try and find tables. This seems rather illogical to me, as I had previously reviewed these pages manually as to assure that these pages in fact do not contain tables.
A possible solution may be that extract_tables
skips a page in case no area is indicated for a particular page, so that the autodetection is not triggered. I think it would improve efficiency and consistent, and should be fairly easy to implement.
You want to invoke the pages
argument here. The meaning of NULL
in areas
is not to skip the page, but to apply the autodetection to the whole page (rather than the subset thereof specified in areas
). If you want to skip a page entirely, you need to leave it out of pages
, so you might try something like this:
w <- which(!sapply(areas_to_extract, is.null))
extract_tables(PDFfile, pages = w, area = areas_to_extract[w])