pyocr Libtesseract: need stress-testing

Someone has been reporting crashes of Paperwork when running the OCR. They are using Tesseract 3.04.01 .. so there may be something wrong with the libtesseract binding.

(Note: currently, the preference order has been changed so Pyocr uses tesseract-sh if possible)

Dec 06 '16 09:12 jflesch

Getting occasional segfaults when using the pyocr.libtesseract tool. Can't pinpoint an exact repeatable cause. Will update if a pattern that triggers the segfault is found.

The other segfault occurs when there is no language data. This one is consistent. screenshot from 2017-03-22 02-05-36

Mar 22 '17 07:03 ghost

If you find a pattern, that would be awesome :-)

I note for the no-language crash. I'll have a look asap (probably this week-end I hope).

Mar 22 '17 09:03 jflesch

BTW, can you tell me which version of Tesseract you use please ?

Mar 22 '17 10:03 jflesch

no-language crash:

The issue appears to come from libtesseract itself and has been reported.
workaround implemented: f2324022deaf7a526e5f6f12cc5d6bf0503944ea

Mar 22 '17 11:03 jflesch

Tesseract version is 3.04.01 from Ubuntu's 3.04.01-4build1

Thanks for the fix.

We lowered Mayan EDMS (http://www.mayan-edms.com) memory footprint by switching to pyocr's libtesseract, thanks for that too :)

Mar 22 '17 19:03 ghost

You're welcome :)

Mar 22 '17 21:03 jflesch

Hm, maybe the crashes were due to a hack: TessBaseAPIDetectOS() was actually a C++ function. I was using ctypes to access it .. and let just say it's not designed for C++, so it is/was a bit hacky. It may have been the cause of crashes on some systems. Tesseract 3.05.00 included a new replacement function TessBaseAPIDetectOrientationScript() that is pure C. @aszlig added support for this new function.

I think I will try to switch libtesseract back as default once Tesseract 4 is out.

May 13 '17 15:05 jflesch