gImageReader icon indicating copy to clipboard operation
gImageReader copied to clipboard

Is possible: tables?

Open manuelsongokuh opened this issue 8 years ago • 24 comments

hello this GIMAGEREADER IS AMAZING!!! i need feature:

  • table (add more lines horizontal is for ROW and add more lines vertical is for COLUMN )

because i will copy from table in clipboard and open libreoffice CALC and paste OR output to CSV..

manuelsongokuh avatar Jan 10 '16 20:01 manuelsongokuh

see example similar add lines: link

manuelsongokuh avatar Jan 10 '16 20:01 manuelsongokuh

Hi!

This certainly would be an useful feature - as far as I can see however tesseract (the OCR engine used by gImageReader) does not have table extraction support built in. Perhaps you could do some research on whether this is really the case, and if so whether some algorithms are already available to do the job?

manisandro avatar Jan 11 '16 08:01 manisandro

ok, but for me is not needed for algorithms, because table (lines:horizontal, vertical) like as area for scan ocr in group A

example -3 lines verticals and 3 horizontals = 9 cells, AS 9 groups. -9 groups add ID numbers (A1,A2,A3,B1,B2,B3,C1,C2,C3) -OCR starts scanning from A1, finish scanning and save in text XXX>> -OCR starts scanning from A2, finish scanning and save in text XXX>>YYY>> -OCR starts scanning from A3, finish scanning and save in text XXX>>YYY>>ZZZ

-OCR starts scanning from B1, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>> -OCR starts scanning from B2, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>> -OCR starts scanning from B3, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ

-OCR starts scanning from C1, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ##newline##XXX>> -OCR starts scanning from C2, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ##newline##XXX>>YYY -OCR starts scanning from C3, finish scanning and save in text XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ##newline##XXX>>YYY>>ZZZ

OCR finish all: text result: XXX>>YYY>>ZZZ XXX>>YYY>>ZZZ XXX>>YYY>>ZZZ

-save in CSV -open libreoffice calc -libreoffice opens file CSV -there is dialog for import txt (csv), active option seperate: TABULATION, click ok. -result perfect table in CELLS.. this is can Do it?

note: ">>"= tabulation

manuelsongokuh avatar Jan 13 '16 12:01 manuelsongokuh

this can works command to awk or sed for processing text background..?

for me not needed algorithms..my think..

manuelsongokuh avatar Jan 13 '16 12:01 manuelsongokuh

Well the problem is how to detect the table cell areas from the image, if I understand correctly what you are saying.

manisandro avatar Jan 13 '16 18:01 manisandro

ah, maybe i write bad english, i try i think not needed for automatic (robot) for "DETECT", i know this is long time for code, impossible! but there is easy to use: gimagereader has rectangle-selection (area for scan OCR), that is ok, but when there is table in page so YOU (or people) need create manually rectangle-selection and add line horizzontal inside of rectangle-selection, and add line vertical inside of rectangle-selection, so this is table create from handle manually, and starts to scan OCR that is goal..

but not automatic to "detect" ok.. this is logn time, evoid.

instead for add line in "area", this is possible for short time..ok?

did you try okular of KDE? there is name "select table" and add lines manually (H,V)

i did done table my table finance personal, i do area 1 for column1 and 2 area for column2, area 3 for column3, start scan and save in txt and i open libreoffice calc, move from under to new column. because gimagereader done OCR and in TXT: XXX XXX XXX

YYY YYY YYY

ZZZ ZZZ ZZZ

so i move yyy and zzz to 2column and 3column of libreoffice calc..

so i think gimagereader can do this area table manually and save time a lot..

okular kde4 is OK, you can try and understand that i said same ok, it's easy for me to use okular area table, but not OCR..smile.. so gimagereader can does it :+1: :+1: :+1: :+1: :+1:

manuelsongokuh avatar Jan 13 '16 18:01 manuelsongokuh

Aha ok I see what you mean, yes if the user defines the table geometry manually then it is definitely easier.

manisandro avatar Jan 13 '16 19:01 manisandro

note i did done my 6 pages tables with gimagereader is perfect! but long time for move texts, in libreoffice..

i hope gimagereader will help me my 200 pages finance personal..

manuelsongokuh avatar Jan 13 '16 19:01 manuelsongokuh

ah perfect! i will wait you for area tables geometry GO GO GO GIMAGEREADER!

manuelsongokuh avatar Jan 13 '16 19:01 manuelsongokuh

sorry, i'm crazy to know: when will add little feature for lines in area "the table geometry manually" like as okular..? but what is name feature? (you can change my title of issue to tittle correct for feature ok?

i want to know when will release or milestone?

i'm love to use gimagereader..but i will wait for my 200 pages.. thank you

manuelsongokuh avatar Jan 17 '16 18:01 manuelsongokuh

Given that gImageReader is purely a spare time project, it really depends on how I'm doing spare time-wise. I'm currently (finally) finishing up an initial implementation for an hOCR editor with PDF generation support, then I'll look at this. Clearly, if you have some knowledge with coding, I always welcome contributions.

manisandro avatar Jan 17 '16 19:01 manisandro

ok. me sorry , i'm not programer.. me sorry... if i'm progromer i will can help you..me sorry..

manuelsongokuh avatar Jan 17 '16 19:01 manuelsongokuh

Never too late to learn ;) Anyways, perhaps I'll manage by the end of the month. I'll tell you when there is something to test, before I'll release a new version.

manisandro avatar Jan 17 '16 19:01 manisandro

i find a phrase : "One interesting tool is the Table selection, which allows you to select a rectangular area, and then divide it into rows and columns. Text selected this way will be available for pasting with rows delimited by newlines and columns delimited by two tab characters." this is OKULAR.

i can to help you to find a information for coding little similar ok?

manuelsongokuh avatar Jan 17 '16 19:01 manuelsongokuh

I'm familiar with how okular works, so it is pretty much a matter of just coding the implementation.

manisandro avatar Jan 17 '16 19:01 manisandro

i dont know if it is helpful for you? http://tex.stackexchange.com/questions/279846/split-the-selection-area-of-two-columns-tabular-or-minipage-or-whatever-works

https://github.com/KDE/okular/search?utf8=%E2%9C%93&q=table

http://stackoverflow.com/questions/488089/extracting-tables-from-pdf-files-programmatically

manuelsongokuh avatar Jan 17 '16 19:01 manuelsongokuh

I've also used tabula in some cases. There is an effort to combine tabula with tesseract to do exactly this.

I'll be following these two repositories fairly closely from here on out!

jakehockey10 avatar Nov 25 '16 21:11 jakehockey10

This feature sounds funny, I want to implement it. But as we know that table have many many variable forms, then how do we to detect column line and row line of difference table?

napasa avatar Jan 08 '18 15:01 napasa

Okular has a table tool which can serve as inspiration (i.e. it requires the user to mark the row and column boundaries). It should also be possible to autodetect them with a smart algorithm. But the main problem is what to do with the result IMO. It doesn't fit into the workflow currently.

manisandro avatar Jan 08 '18 15:01 manisandro

Leptonica has some new table detection features - please see https://github.com/DanBloomberg/leptonica/search?q=table&type=Commits&utf8=%E2%9C%93

Shreeshrii avatar Jan 11 '18 11:01 Shreeshrii

Currently working on hOCR editor with pdf export... [4 years ago] Is that project finished?

I wonder if the new hOCR features in Tesseract 5 can help in creating tables?

BTW if we can save the OCRed file as HTML or rtf, we can open it in LibreOffice and convert it in table. If the original table has merged cells or nested tables, we can create that in LibreOffice.

raindropsfromsky avatar Mar 28 '20 04:03 raindropsfromsky

I wonder if the new hOCR features in Tesseract 5 can help in creating tables?

Do you have any reference in regard to point out?

BTW if we can save the OCRed file as HTML or rtf, we can open it in LibreOffice and convert it in table.

The hOCR format is actually HTML, which you can save as such You can also export to ODT.

manisandro avatar Mar 28 '20 12:03 manisandro

See also https://github.com/tesseract-ocr/tesseract/issues/1714

manisandro avatar Apr 14 '20 12:04 manisandro

gImageReader is indeed amazing, but this issue with tables is essential for most non-trivial real-world OCR use cases.

Tesseract can't and won't be able to handle anything but very basic tables anytime soon, as even the best table algorithms can only do basic tables. Frankly, I don't even see an AI being able to handle all types of tables without user interaction (not for many years, still).

Therefore, the GUI solution is essential.

I used Abby Finereader for many years and it did have both a table-recognition algorithm (good for about 75% of tables), and the ability to define a table selection box, add horizontal and vertical lines to it, merge and split cells, and generate the whole table in a Word format for output. It was very good.

That is a very advanced solution, and more of a wish list for gImageReader at this point.

But, Manisandro, here is how I would approach this in gImageReader:

Phase I

  1. You already have selection boxes, add the ability to place horizontal and vertical lines in them; this defines cells.
  2. Submit each cell to Tesseract for recognition as an individual box, and follow a right-left, top-bottom order.
  3. Present the results in plain text in the order it was read, one cell per paragraph.
  4. Roll out Phase I, as this will solve 80% of the table problems for everyone out there.

Phase II

  1. Add the ability to merge and split cells, keep everything else the same.

Phase III

  1. Add the ability to present the output in a table format.

Phase IV

  1. Add an algorithm for automatic table layout detection within the table selection box. Or, by this time, Tesseract will have this functionality available in the API, and that will make it easier. Note that even then the previous Phases will STILL be needed.

Keep up the awesome work, and thank you!!!

ebaldino avatar Dec 16 '23 12:12 ebaldino