tocPDF icon indicating copy to clipboard operation
tocPDF copied to clipboard

For 2 column style TOC that is scan/ocr -- suggestion for best tool to extract TOC text?

Open stillhope opened this issue 1 year ago • 0 comments

Hello, Can you suggest best tool to extract the TOC text, from a 2 column TOC style (PDF is scanned and ocr'd).

The problem with OCR space it does not read the text in columns, e.g. first column then second column. Rather it reads left to right, so you get the text in the wrong place

For example: extract result from OCR space is (chapter Six is in column 2 of the TOC and the tool has read it on line 1)

Contents Number Chapter Six: Units..............:.......48
Length, mass, capacity
Chapter One: Types Of and time.... ....

The problem with Tabular is I could not find any 2 column style TOC template. I tried to create my own template as a new person, and it did a very average job (e.g. did not recognise end of sentence, kept leading ..... before page number. I could not find any auto scripts in sublime text editor to handle the typical TOC edit text issues either.

Nuntber, Chapter One: Types of, number ........................................... 2, Squares and square roots .................,2 Cubes and cube roots .......................,2 Multiples .......................................,4 Prime factorisation ..........................,6 Chapter Two: Using numbers .....1 0,

Tabular is better than OCRspace, in the fact text is in the correct order but still alot of manipulation using Sublime Text Editor to get the "TOC text file " into the required layout to be able to auto-create TOC bookmarks in PDF (ie using one of the apps, pdftk or jpdfbookmarks)

Tabular is currently has no ability to ask questions of help. On github the issue tab is not showing.

stillhope avatar Jan 10 '23 21:01 stillhope