tabulapdf icon indicating copy to clipboard operation
tabulapdf copied to clipboard

locate areas / extract tables renders unexpected results

Open jkeuskamp opened this issue 7 years ago • 1 comments

I have a small issue with the way locate_areas and extract_tables interact

I use something like:

areas_to_extract<-locate_areas(PDFfile)
extract_tables(PDFfile, area=areas_to_extract)

areas_to_extract is a list of length pages, with each position representing a page. Positions representing pages that I have specified areas for contain coordinates, while the pages that I have not indicated an area for, are left empty.

When passing the generated list to extract_tables, empty positions invoke the autodetection algorithm to try and find tables. This seems rather illogical to me, as I had previously reviewed these pages manually as to assure that these pages in fact do not contain tables.

A possible solution may be that extract_tables skips a page in case no area is indicated for a particular page, so that the autodetection is not triggered. I think it would improve efficiency and consistent, and should be fairly easy to implement.

jkeuskamp avatar Apr 16 '17 11:04 jkeuskamp

You want to invoke the pages argument here. The meaning of NULL in areas is not to skip the page, but to apply the autodetection to the whole page (rather than the subset thereof specified in areas). If you want to skip a page entirely, you need to leave it out of pages, so you might try something like this:

w <- which(!sapply(areas_to_extract, is.null))
extract_tables(PDFfile, pages = w, area = areas_to_extract[w])

leeper avatar Apr 16 '17 14:04 leeper