tesserocr
tesserocr copied to clipboard
how to get .tsv result
With tesseract command line, we can get .pdf or .hocr or .tsv file apart from .txt result by passing in a config file, can we get these results in tesserocr?
You should be able to use the same APIs that the tesseract cli uses. I'm not sure what you want exactly but you can look at the ReadConfigFile, SetVariable, GetHOCRText, GetTSVText, ProcessPages methods.
Thanks for your reply. What I want is for tesserocr to output and save a .tsv file. This can be done if I run tesseract from the terminal and use the configuration file tsv. The result is different from GetTSVText. Using ProcessPages I did get .pdf file, but it seems it couldn't render a .tsv file. Anyways thanks for your help.
On Wed, Jun 26, 2019 at 10:32 AM Fayez [email protected] wrote:
You should be able to use the same APIs that the tesseract cli uses. I'm not sure what you want exactly but you can look at the ReadConfigFile, SetVariable, GetHOCRText, GetTSVText, ProcessPages methods.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/183?email_source=notifications&email_token=AFLBR7Q6Z6Z3TFLEPR4TZ3TP4N4WVA5CNFSM4H3LY34KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYTXHRY#issuecomment-505902023, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLBR7RTOORFV6J5LNYRF2DP4N4WVANCNFSM4H3LY34A .
The result is different from GetTSVText
That's true, but the difference is small: it's just the document "preamble" that's missing:
"level\tpage_num\tblock_num\tpar_num\tline_num\tword_"
"num\tleft\ttop\twidth\theight\tconf\ttext\n"
You should be able to get the typical CLI behaviour of rendering to output files by using ProcessPage / ProcessPages after calling ReadConfigFile('tsv') and ReadConfigFile('pdf') etc. (These refer to filenames under /usr/local/share/tessdata/configs or similar.)
with PyTessBaseAPI(psm=PSM.AUTO, oem=OEM.LSTM_ONLY, lang="eng") as api: # automatic page seg.; LSTM model ; english language
api.SetImageFile(image_name) #image_name is the name of the image file on which you want to run OCR
api.Recognize()
print (api.GetTSVText(0)) # for first page