tabulapdf
tabulapdf copied to clipboard
Extract Area around a matching string
Hello, I have tables in the format
abc : 12345566 cde : 456782 gef : 45345435
where abc,def are the same and the other number vary. When I extract specific area, I get dataframe with 2 columns which is perfect. My problem however is , the tables sometimes split over two pages depending on the extra lines on number side and there is one value "xyz" which is present for some tables.
Is there a way to be able to get the area around a search string that way I know at which value the table got split in second page and also , if "xyz" is present , I can change the area accordingly.
Hopefully I am making sense...
I tried using suggestion by @leeper on another thread by giving same page number three times and specifying the areas. The first two tables were extracted perfect, however the third table which was flowing into next page gave error
org_affiliates_3 <- as.data.frame(extract_tables(fn, pages = c(sel_page,sel_page,sel_page), area = c(list(c(top3,left3,bottom3,right3)),list(c(top3+194,left3,bottom3+194,right3)),list(c(top3+194+194,left3,bottom3+194+194,right3))), guess = FALSE, method = "data.frame"))
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 10, 6
Is there a way to just bypass the error, and extract only 6 rows for third table
If you use a different methodargument, you should be able to get back a structure that isn't passed through data.frame(). You might try the method = "csv" option, which will just save a CSV locally (directly from the underlying Java code) or method = "character".
Thankyou @leeper I tried all the methods.. but none of them seem to work extract_tables(fn, pages=7,package = "tabulizer",method = "csv") [1] "C:\tabulizer\phd\Ind" there is no file created in the folder
extract_tables(fn, pages=7,package = "tabulizer",method = "character") list() extract_tables(fn, pages=7,package = "tabulizer",method = "asis") [1] "Java-Object{[]}" extract_tables(fn, pages=7,package = "tabulizer",method = "tsv") [1] "C:\tabulizer\phd\Ind" - No file created extract_tables(fn, pages=7,package = "tabulizer",method = "json") [1] "C:\tabulizer\phd\Ind"