tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

how to get .tsv result

Open aspenlin opened this issue 6 years ago • 4 comments

With tesseract command line, we can get .pdf or .hocr or .tsv file apart from .txt result by passing in a config file, can we get these results in tesserocr?

aspenlin avatar Jun 25 '19 20:06 aspenlin

You should be able to use the same APIs that the tesseract cli uses. I'm not sure what you want exactly but you can look at the ReadConfigFile, SetVariable, GetHOCRText, GetTSVText, ProcessPages methods.

sirfz avatar Jun 26 '19 14:06 sirfz

Thanks for your reply. What I want is for tesserocr to output and save a .tsv file. This can be done if I run tesseract from the terminal and use the configuration file tsv. The result is different from GetTSVText. Using ProcessPages I did get .pdf file, but it seems it couldn't render a .tsv file. Anyways thanks for your help.

On Wed, Jun 26, 2019 at 10:32 AM Fayez [email protected] wrote:

You should be able to use the same APIs that the tesseract cli uses. I'm not sure what you want exactly but you can look at the ReadConfigFile, SetVariable, GetHOCRText, GetTSVText, ProcessPages methods.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/183?email_source=notifications&email_token=AFLBR7Q6Z6Z3TFLEPR4TZ3TP4N4WVA5CNFSM4H3LY34KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYTXHRY#issuecomment-505902023, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLBR7RTOORFV6J5LNYRF2DP4N4WVANCNFSM4H3LY34A .

aspenlin avatar Jun 26 '19 14:06 aspenlin

The result is different from GetTSVText

That's true, but the difference is small: it's just the document "preamble" that's missing:

"level\tpage_num\tblock_num\tpar_num\tline_num\tword_"
"num\tleft\ttop\twidth\theight\tconf\ttext\n"

You should be able to get the typical CLI behaviour of rendering to output files by using ProcessPage / ProcessPages after calling ReadConfigFile('tsv') and ReadConfigFile('pdf') etc. (These refer to filenames under /usr/local/share/tessdata/configs or similar.)

bertsky avatar Dec 18 '19 02:12 bertsky

with PyTessBaseAPI(psm=PSM.AUTO, oem=OEM.LSTM_ONLY, lang="eng") as api: # automatic page seg.; LSTM model ; english language

    api.SetImageFile(image_name) #image_name is the name of the image file on which you want to run OCR
    api.Recognize()

    print (api.GetTSVText(0)) # for first page

mineshmathew avatar Jan 01 '21 03:01 mineshmathew