Manuel Aristarán

Results 61 comments of Manuel Aristarán

@lukehsiao, it's an implementation of a classic technique in document analysis and segmentation. The basic idea is to calculate the vertical and horizontal projections (sum of heights and widths) of...

Thanks, @mhkeller Can you share that PDF? I'd love to take a look at it.

Also, if you're using Tabula from the command line, you might want to try [tabula-java](https://github.com/tabulapdf/tabula-java) (pure java, easier to install, ~3x faster) — `tabula-extractor` is going to be deprecated soon.

OK, so the output is expected for that PDF. Without visual separators ("ruling lines") we can't really merge multiline cells. I usually post process those cases with a script that...

Thanks, @Kanz95 Which version of the Tabula app are you using? The current one (1.2.0) only detects tables in the last two pages as well.

Hi @Kanz95, No. I don't know when I'm going to be able to look at this (Tabula is a side project, I don't get paid for it). If you're building...

…and contribute back the fix, if you're so inclined.

> Sometimes, there are lines that are not visible to a viewer, but are present in the PDF. That's the case here: there's a white line running across cell B6...

It could also be a subsetted-font, which is essentially a non-standard encoding. See [this StackOverflow answer](https://stackoverflow.com/questions/8039423/pdf-data-extraction-gives-symbols-gibberish).

The too-small area comes directly from `tabula-extractor`, so it will also happen in tabulapdf/tabula@master . There's no user selection involved in this issue, so the selection-boundaries-as-rulings feature won't solve it.