tesseract
tesseract copied to clipboard
Detect text rotation without running recognition
As noted in the documentation , Tesseract performs poorly when the page is at an angle (not a multiple of 90 degrees). This limitation is not problematic from an accuracy standpoint, as Tesseract accurately reports the angle of text lines, so my existing pipeline rotates and re-runs recognition on any image where the angle is significant. However, this is computationally inefficient as there does not appear to be any way to get the page angle without also running recognition (despite estimating page angle/gradient being one of the first things calculated).
Therefore, it would be of significant benefit to be able to get the page angle without running the entire recognition process. I'll work on a build that does this myself--my initial thought is to add a config option that tells Tesseract to report the page angle and quit early (before recognition) if median line angle is above a user-defined threshold, however let me know if others have thoughts on implementation.
For such image prerocessing I would suggest to have a look at the leptonica programs/function examples) flipdetect_reg ,skewtest, skew_reg, and maybe dewarptest2...
Of course there are limitations (see e.g. issue 622), but they are fast and reliable for most of my cases...
IMHO such prepossessing should be done outside of tesseract.
Thanks for your response, I will review the Leptonica scripts linked before deciding how to implement.
I found a much, must faster solution to detect page rotation. Call SetImage followed by DetectOrientationScript and then call
Pix *rotated = pixRotateOrth(pix, (360 - degree) / 90);
However, there is currently a bug that causes this to fail randomly so you need my short patch from https://github.com/tesseract-ocr/tesseract/issues/4062
It is here: https://github.com/DanBloomberg/leptonica/blob/0ffbc6822c23725b5b9f6876e2620a22ba3689f4/src/rotateorth.c#L64
https://github.com/DanBloomberg/leptonica/blob/0ffbc6822c23725b5b9f6876e2620a22ba3689f4/src/rotateorth.c#L64
That is the API to rotate an image, but not the API to detect if it is rotated. Tesseract docs and some StackOverflow comments recommend Recognize(), but that is extremely slow. On a sample tiff I used, it took .9 seconds for DetectOrientationScript vs 2.1 seconds for Recognize - when both were followed by 90 rotation and another Recognize to extra text
@todd-richmond, you are talking about orientation detection: 0 / 90 / 180 / 270 degrees.
@Balearica is talking about a page with some parts that are skewed
Never mind. I missed the "not" 90 when reading. De-skewing is much more challenging so we haven't bothered dealing with that for now
@Balearica,
Did you try using AnalyseLayout()
?
https://github.com/tesseract-ocr/tesseract/blob/bf7c134ba6958f2efdaace2fbeba31cad91394ce/include/tesseract/baseapi.h#L433-L449
@amitdo I did not end up implementing this way, but do believe that running AnalyseLayout
and then using the lines to re-calculate the average gradient would be another way to go about this.
I ended up creating a branch that allows for retrieving the number Tesseract already calculates, which I pushed to #4070. I think this is the most direct approach, and the only approach that does not involve redundant calculations.