Detecting Text Data as Table while working with Java.

Open urmay opened this issue 6 years ago • 1 comments

while passing only first page as command line argument it is able to detect table from the whole text. But when passing the whole document it is also detecting the text as the table. Version: tabula-1.0.2.jar

Java Code:

public static void main(String[] args) throws ParseException
{
	// String commandLineOptions[] = {"-p", "all", "-o", "$tsv"};
	String commandLineOptions[] = {"-p", "1", "-o", "$tsv"};

	CommandLineParser parser = new DefaultParser();
	try
	{
		CommandLine line = parser.parse(buildOptions(), commandLineOptions);
		new CommandLineApp(System.out, line).extractFileInto(
				new File("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf"),
				new File("C:/Users/path to pdf/ast_sci_data_tables_sample.tsv"));
	}
	catch (Exception e)
	{
		e.printStackTrace();
	}
}

Surprisingly while applying same file on python library it is able to detect only tables from the whole pdf.

Python Code:

import tabula
from tabula import read_pdf
from tabula import convert_into
df=read_pdf("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf",multiple_tables=True,pages = 'all')
convert_into("C:/Users/path to pdf/ast_sci_data_tables_sample.pdf","test.json",output_format="json",multiple_tables=True,pages = 'all')

Pdf file :http://www.sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf

May 07 '19 13:05 urmay

Use Nurminen detection algorithm to detect only tables, and the you can use BasicExtractorAlgorthm to extract the data into required format like csv, html.

May 29 '20 10:05 satyaraj479