Manuel Aristarán
Manuel Aristarán
@lukehsiao, it's an implementation of a classic technique in document analysis and segmentation. The basic idea is to calculate the vertical and horizontal projections (sum of heights and widths) of...
Thanks, @mhkeller Can you share that PDF? I'd love to take a look at it.
Also, if you're using Tabula from the command line, you might want to try [tabula-java](https://github.com/tabulapdf/tabula-java) (pure java, easier to install, ~3x faster) — `tabula-extractor` is going to be deprecated soon.
OK, so the output is expected for that PDF. Without visual separators ("ruling lines") we can't really merge multiline cells. I usually post process those cases with a script that...
Thanks, @Kanz95 Which version of the Tabula app are you using? The current one (1.2.0) only detects tables in the last two pages as well.
Hi @Kanz95, No. I don't know when I'm going to be able to look at this (Tabula is a side project, I don't get paid for it). If you're building...
…and contribute back the fix, if you're so inclined.
> Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6...
It could also be a subsetted-font, which is essentially a non-standard encoding. See [this StackOverflow answer](https://stackoverflow.com/questions/8039423/pdf-data-extraction-gives-symbols-gibberish).
The too-small area comes directly from `tabula-extractor`, so it will also happen in tabulapdf/tabula@master . There's no user selection involved in this issue, so the selection-boundaries-as-rulings feature won't solve it.