tesseract
tesseract copied to clipboard
peculiarities when running text2image on windows
(this is more of a comment than an issue but more issues can follow and the discussion might be useful; nevertheless, it might be closed after the PR for 1. )
- At the moment, text2image expects
fc
backend e.g,: https://github.com/tesseract-ocr/tesseract/blob/ba2ea39caaa791b5e5f092953057cb8ffb094a82/training/pango_font_info.cpp#L356 but if pango is compiled with win32 support, you get the win32 font map first
#if defined(HAVE_CAIRO_WIN32)
if (!backend || 0 == strcmp (backend, "win32"))
return g_object_new (PANGO_TYPE_CAIRO_WIN32_FONT_MAP, NULL);
#endif
#if defined(HAVE_CAIRO_FREETYPE)
if (!backend || 0 == strcmp (backend, "fc")
|| 0 == strcmp (backend, "fontconfig"))
return g_object_new (PANGO_TYPE_CAIRO_FC_FONT_MAP, NULL);
#endif
and nasty crashes follow because of the wrong reinterpret cast.
Fast Solution: specify fc
backend
Solution: a simple patch will follow that fixes the behaviour for, at least, the most important functionality.
2. If fontconfig is linked as dll, putenv does not get propagated to fontconfig
https://github.com/tesseract-ocr/tesseract/blob/ba2ea39caaa791b5e5f092953057cb8ffb094a82/training/pango_font_info.cpp#L151
Solution: specify it as environmental variable
3. You cannot use disk paths (e.g., c:) in FONTCONFIG_PATH
because fontconfig strips slashes from path (FcStrCanonAbsoluteFilename) and then uses
GetFullPathNameW (dirname, 0, NULL, NULL)
without the slash and that function, interestingly, behaves like this
a file name begins with only a disk designator but not the backslash after the colon, it is interpreted as a relative path to the current directory on the drive with the specified letter.
Solution: specify a sane directory
None of these issues has been solved.
At least the first one probably affects Tesseract running in MinGW and Mac.
Fast Solution: specify fc backend
https://github.com/GNOME/pango/blob/master/pango/pangocairo-fontmap.c#L48
Something like this should be put in text2image.cpp
:
#ifdef _WIN32
putenv("PANGOCAIRO_BACKEND=fc");
#else
setenv("PANGOCAIRO_BACKEND", "fc", 1);
#endif // _WIN32
Should be tested on Mac and MinGW before committing this code. This issue does not affect Linux.
I think this issue should be reopened.
AFAIK vidiecan is using VS. Is there any report from mingw users?
He fixed number (1) in his list in one place in the code. That piece of code did cause a crash on Windows+VS, MinGW(64) and Mac. There is another similar piece of code that will probably cause a crash in some situation on all these platforms. I suggested a solution above, but it useless to test it on Linux.
Here is the problematic line: https://github.com/tesseract-ocr/tesseract/blob/182ca5bc1e/training/pango_font_info.cpp#L367
You need to use text2image
with the flag only_extract_font_properties
to trigger the function in which this code lives.
The dotted_circle changes in #381 caused problems (in Linux at least). See: https://github.com/tesseract-ocr/tesseract/blob/5bb97f966885/training/pango_font_info.cpp#L438
@vidiecan : Are points 2. and 3. still valid? If yes, do you have PR for it?
The relevant code was rewritten in Tesseract 5.0.
@stweil,
Do you know if all the issues that were mentioned by the OP were solved?
No, I don't know that and would have to run tests first. @vidiecan, did you test with the latest installer for Windows?