tabula-extractor icon indicating copy to clipboard operation
tabula-extractor copied to clipboard

Put multi-line cell content into a single cell

Open jpmckinney opened this issue 10 years ago • 6 comments

Out of the test data, the files that don't copy-paste from Preview to Excel cleanly are:

  • bo_page24.pdf
  • gre.pdf
  • vertical_rulings_bug.pdf

On the other hand, Tabula chokes on some PDFs that copy-paste just fine! For example, page 2 of this PDF from this website.

I cooked up an AppleScript to do bulk copy-pasting in these cases: https://github.com/opennorth/copy_paste_pdf

I'm not sure why Tabula has trouble with the linked PDF, but maybe it will be another useful test case.

jpmckinney avatar Oct 10 '13 00:10 jpmckinney

James,

Could you clarify how Tabula "chokes" on that Newfoundland and Labrador PDF? Do you mean that Tabula outputs multi-line cells on different lines, where copy-paste properly includes them on the same line? Or does Tabula completely fail to output anythign?

Thanks!

jeremybmerrill avatar Oct 10 '13 15:10 jeremybmerrill

Sure, @jeremybmerrill. Using Tabula from git HEAD, with that PDF, Tabula either:

  1. Moves a cell that should be in the same row as other cells into a new row below
  2. Splits the content of one or more cells in a given row onto multiple rows (either above or below the row with the rest of the cells)

When copy-and-pasting that PDF, the first error never occurs, and the second error occurs less frequently. When it does occur, the rows are split into new rows that appear below, and never above, making it easier to write a script to clean the CSV.

If I do not select the table precisely (and just put a square around the whole page), the CSV is much worse than copy-and-pasting. "Autodetect Tables" gives JavaScript errors, so I couldn't test it.

jpmckinney avatar Oct 10 '13 16:10 jpmckinney

I've put a gist to compare the two CSVs here: https://gist.github.com/jpmckinney/6921697

The copy-and-paste method creates more empty rows, but those are very easy to clean in post-processing. It adds an extra space at the end of each cell, but that is also very easy to clean. The copy-and-paste method is nearly perfect in this case, whereas Tabula requires careful post-processing.

jpmckinney avatar Oct 10 '13 16:10 jpmckinney

Thanks for the additional details, @jpmckinney!

By #! do you mean like where Tabula puts "Classifications Appeal Board" a line below "Jean Myrick" on row 1 of page 2? Just want to be clear. :)

Assuming so, that's a bug we're aware of (though I can't find the issue here...). It's definitely a big one.

You're right that the post-processing to combine cells is tough. I've written quite a few bespoke scripts to deal with that output for a production project that uses tabula-extractor. They're a pain in the ass, I know... I'm sure the guy who inherited that project from me would be ecstatic if we solved it. But we're not quite there yet, algorithmically.

I think our approach would be use the line elements on the page to group text elements with different y-locations into a single cell. This may have to wait until https://github.com/jazzido/tabula-extractor/issues/16 is finished, because the more line detection we do via computer vision (our current approach for detecting tables, as opposed to cells), the slower Tabula will be.

Another approach (ignoring lines) might be to use some sort of heuristic to look at the distances between a cell and the closest non-empty one above it. If the distance is relatively greater, it might be a new row; if it's less, it might be a continuation of the previous cell. This might get gross, though -- and only be successful some of the time

Would love to hear your input and we appreciate the test file.

jeremybmerrill avatar Oct 11 '13 14:10 jeremybmerrill

Yup, that's what I mean by point 1.

Yeah, #16 seems to be the solution. I was originally going to hack together a script to find rectangles, tesseract each rectangle, and recompose a table, when I discovered that copy-pasting magically worked (it helps that the PDF was exported from Excel). #16 sounds much more robust!

jpmckinney avatar Oct 11 '13 14:10 jpmckinney

Great. Because PDF is such a shit format, different PDF generators generate radically different structures that represent similar-looking PDFs. Getting perfect coverage is definitely a goal, though there's obviously work still to be done.

You might be intrigued by the (dochive)[https://github.com/raleighpublicrecord/dochive] project. I don't know if it's still active, but I know that their aim was to do just what you were thinking for scanned PDFs -- create a template system or use CV to find rectangles, tesseract their contents, and export that as a CSV.

I just renamed the issue, I hope you don't mind.

jeremybmerrill avatar Oct 14 '13 13:10 jeremybmerrill