tabula-java Extract 'all' pages

Hello Have you noticed any difference in using the 'all' pages option and calling each page separately? I had a case where the table extraction process delivered better results when iterating through the pages and calling tabula for each page. When I used the 'all' pages option, one table wasn't extracted properly, but calling tabula with the page number delivered the table data. Iterating through all pages and calling tabula each time is significantly slower than calling tabula once with 'all' pages. Can you guys run a couple of tests?

Feb 13 '17 08:02 klevismino

@klevismino can you send us the document where you noticed this problem, so we can test?

Feb 13 '17 16:02 jeremybmerrill

@jeremybmerrill Thank you for your response. I cannot share that document, but I will generate some sample PDFs and I will share the results with you.

Feb 14 '17 12:02 klevismino

Hello @jeremybmerrill Sorry for the delay. Below you can find an example PDF file. Running java -jar tabula-0.9.2-jar-with-dependencies.jar a.pdf -p 2 -o a_2.csv produces good results, but when running java -jar tabula-0.9.2-jar-with-dependencies.jar a.pdf -p all -o a_all.csv doesn't extract the second page correctly. Do you know why this happens? File: a.pdf

Feb 20 '17 16:02 klevismino

I can replicate. Not sure why this is happening.

Feb 26 '17 17:02 jeremybmerrill

Any news? I think that it has something to do with the page orientation (landscape).

Mar 02 '17 15:03 klevismino

@klevismino: I don't have an update. Next time I dive into the tabula-java source, I'll take a look. Or, @jazzido, @melisabok if either of yout are working on something similar to this and have time.

Mar 16 '17 18:03 jeremybmerrill

What I could see until now is this: when tabula tries to decide which extraction algorithm to use, makes different decisions based on the parameter -p.

if p == 2 then the extraction algorithm is Basic, so it works if p == all then extraction algorithm is Spreadsheet, and it is not extracting the table correctly for the page 2.

@jazzido this is because the method SpreadsheetExtractionAlgorithm.isTabular returns different values(false for page == 2 and true when iterates all the pages) when I extract one page or all the pages, I couldn't figure out why. Is it the page instance different when we iterate through all the pages or get a specific one?

@klevismino if you run tabula adding the parameter -t, you can force tabula to use only the Basic algorithm extractor that works for this file.

java -jar target/tabula-0.9.2-jar-with-dependencies.jar a.pdf -p all -o all.csv -t

Expected result:

"",
4. PORTFOLIO VALUATION REPORT,
"",
"",
Enclosed in the following pages is a valuation report of the investments as at 27 March 2016. The,
valuation is conducted using the valuation methodology described in the report and has not been,
audited by an external firm of auditors.,
Page | 24,ABC Fund
SOMETHING,ABC Fund (No. 2)
"",For current quarter
ABC One or other Fund,,,,,,,,,
My Summary,,,,,,,,,
"March 27, 2016",,,,,,,,,
"",,,,,,,,,
"",,,,,,,,,
"",Investment,,Cost,Received,Value,Total Value,,,Period
Country,,Date,US$m,US$m,US$m,US$m,x Money,IRR,(mths)
Unrealised A (Held at Fair Value),,,,,,,,,
John Doe International Myland,Sep 2011,,123.3,123.7,123.0,123.7,1.11x,-9%,49
ABC Global Limited Yourland,Oct 2011,,123.7,-,123.4,123.4,1.11x,9%,48
Smith Pty Ltd Myland,Oct 2011,,123.4,123.9,123.1,123.0,1.11x,9%,47
"(1)",,,,,,,,,
XYZ Ltd Herland,Jul 2010,,123.0,-,123.0,123.0,1.xxx,0%,48
Grand Total,,,"1,123.0",123.6,"1,123.4","1,123.9",5.xxx,50%,47
"(2)",,,,,,,,,
Note:,,,,,,,,,
"(1) Valued at  cost  within the first  12 months of investment",,,,,,,,,
"(2) Net of fees and  expenses",,,,,,,,,

Mar 19 '17 01:03 melisabok