ocr-table icon indicating copy to clipboard operation
ocr-table copied to clipboard

Extracting table data?

Open munikarmanish opened this issue 5 years ago • 5 comments

Right now, it only seems to perform OCR. i.e., convert image to raw text. Is there any table-specific extraction performed? Basically, I'm researching about good algorithms to extract tabular data from scanned documents.

Thanks in advance. :)

munikarmanish avatar Jan 02 '19 04:01 munikarmanish

Hi, @munikarmanish !

You're correct. The OCR currently only works for pre-processed images.

While it does extract data from PDFs with tables, it currently performs a horizontal scan and doesn't perform any table based classification on the text yet, I'm still trying to figure out how to make that work.

A make-do solution could be to classify the text after extraction based on the length of columns but that will only work if every column has a fixed length of words, which is not the case in most scenarios.

cseas avatar Jan 11 '19 04:01 cseas

The way to do this is to use code to do table detection (column and row) and then preform the ocr within the table it's a really hard problem though.

aribornstein avatar Feb 12 '19 09:02 aribornstein

Hi @munikarmanish did you found any thing regarding the research you mentioned above ?

jaysinghr avatar Jun 27 '19 05:06 jaysinghr

Hi @munikarmanish did you found any thing regarding the research you mentioned above ?

Yes, I've found a few interesting approaches:

munikarmanish avatar Jul 03 '19 04:07 munikarmanish

I am also facing above issue. did any found best solution after 2 years?

SAIVENKATARAJU avatar Nov 10 '21 12:11 SAIVENKATARAJU