ocr-table Extracting table data?

Extracting table data?

Open munikarmanish opened this issue 5 years ago • 5 comments

Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.

Thanks in advance. :)

Jan 02 '19 04:01 munikarmanish

Hi, @munikarmanish !

You're correct. The OCR currently only works for pre-processed images.

While it does extract data from PDFs with tables, it currently performs a horizontal scan and doesn't perform any table based classification on the text yet, I'm still trying to figure out how to make that work.

A make-do solution could be to classify the text after extraction based on the length of columns but that will only work if every column has a fixed length of words, which is not the case in most scenarios.