tabula-java
tabula-java copied to clipboard
Use both lattice and columns options
Is your feature request related to a problem? Please describe. Lattice=True does not work with a specific document because the table does not have visible vertical column lines. I'm using area and columns options to specify the portion of the page to consider and the x coordinates for the column boundaries, which works well. However, I think word wrapping in the cells is an issue, and each row is split into multiple rows.
Describe the solution you'd like I'd like to use the lattice option to allow tabula to detect row boundaries, while also using the columns option to specify where the column boundaries are. I'm open to alternatives that would achieve what I need, though.
Describe alternatives you've considered I've tried using just the lattice option, which works as designed to detect rows, but combines all columns. I.e. I get an extracted dataframe that is a single column with the correct number of rows. If I use the columns option instead of lattice, I get the right number of columns but a lot of extra rows.
Additional context Python 3.6.5 Java 1.8.0_181
PDF document I'm working on is here: Projects.pdf
lattice option method:
columns option method:
Please let me in further as an author of tabula-py. [ref]
@jscottNRG wants to use both -c
and -l
options at once as follows:
$ java -jar tabula-1.0.2-jar-with-dependencies.jar -a 145,25,695,1195 -c 156,252,364,811,909,1019 -l Projects.pdf
You may aware the PDF only has horizontal lines of the table. Is there any way to use -c
and -l
option at the same time?
You both raise an interesting point!
I think this does not exist yet, but, could exist in principle. We would probably also accept a pull request to do this; there's a remote (but existent) possibility that there might be some funding to work on extraction algorithms that might enable this. (cc @jazzido... also in case I'm missing something )
The way to approach it likely would be to do the "stream" analysis to find the column locations (or accept them as parameters), then add those manually to the spreadsheet object reconstructed by the "lattice" analysis, by construing a Ruling
object for them and adding them here. I think it'd be doable... and what's better, I think it'd be solve a real problem for a decent subset of PDFs. That is, I think the format you describe is not so idiosyncratic that an extraction algorithm couldn't exist to cope with it.
The way to approach this first would be to implement this algorithm, turned on explicitly by a command-line switch (or option in tabula-py
). Eventually we could experiment with heuristics that'd let us guess when to use this extraction method (and thus eventually get it in the GUI), but let's save that for later.
Just FYI, PDFPlumber has separated configurations for vertical and horizontal table extraction. I know the implementation of tabula-java could be different, but I guess there is some way to approach this issue.
Perfect, thanks all. @chezou I tried extracting the tables with PDFPlumber and it worked like a charm.
How can we get the column parameter values for the pdfs?