tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

peculiarities when running text2image on windows

Open vidiecan opened this issue 8 years ago • 9 comments

(this is more of a comment than an issue but more issues can follow and the discussion might be useful; nevertheless, it might be closed after the PR for 1. )

  1. At the moment, text2image expects fc backend e.g,: https://github.com/tesseract-ocr/tesseract/blob/ba2ea39caaa791b5e5f092953057cb8ffb094a82/training/pango_font_info.cpp#L356 but if pango is compiled with win32 support, you get the win32 font map first
#if defined(HAVE_CAIRO_WIN32)
  if (!backend || 0 == strcmp (backend, "win32"))
    return g_object_new (PANGO_TYPE_CAIRO_WIN32_FONT_MAP, NULL);
#endif
#if defined(HAVE_CAIRO_FREETYPE)
  if (!backend || 0 == strcmp (backend, "fc")
           || 0 == strcmp (backend, "fontconfig"))
    return g_object_new (PANGO_TYPE_CAIRO_FC_FONT_MAP, NULL);
#endif 

and nasty crashes follow because of the wrong reinterpret cast.

Fast Solution: specify fc backend Solution: a simple patch will follow that fixes the behaviour for, at least, the most important functionality. 2. If fontconfig is linked as dll, putenv does not get propagated to fontconfig https://github.com/tesseract-ocr/tesseract/blob/ba2ea39caaa791b5e5f092953057cb8ffb094a82/training/pango_font_info.cpp#L151

Solution: specify it as environmental variable 3. You cannot use disk paths (e.g., c:) in FONTCONFIG_PATH because fontconfig strips slashes from path (FcStrCanonAbsoluteFilename) and then uses

GetFullPathNameW (dirname, 0, NULL, NULL)

without the slash and that function, interestingly, behaves like this

a file name begins with only a disk designator but not the backslash after the colon, it is interpreted as a relative path to the current directory on the drive with the specified letter.

Solution: specify a sane directory

vidiecan avatar Aug 05 '16 17:08 vidiecan

None of these issues has been solved.

At least the first one probably affects Tesseract running in MinGW and Mac.

Fast Solution: specify fc backend

https://github.com/GNOME/pango/blob/master/pango/pangocairo-fontmap.c#L48

Something like this should be put in text2image.cpp:

#ifdef _WIN32
 putenv("PANGOCAIRO_BACKEND=fc");
#else
  setenv("PANGOCAIRO_BACKEND", "fc", 1);
#endif // _WIN32

Should be tested on Mac and MinGW before committing this code. This issue does not affect Linux.

amitdo avatar Sep 07 '16 16:09 amitdo

I think this issue should be reopened.

amitdo avatar Nov 07 '16 11:11 amitdo

AFAIK vidiecan is using VS. Is there any report from mingw users?

zdenop avatar Nov 07 '16 14:11 zdenop

He fixed number (1) in his list in one place in the code. That piece of code did cause a crash on Windows+VS, MinGW(64) and Mac. There is another similar piece of code that will probably cause a crash in some situation on all these platforms. I suggested a solution above, but it useless to test it on Linux.

amitdo avatar Nov 07 '16 14:11 amitdo

Here is the problematic line: https://github.com/tesseract-ocr/tesseract/blob/182ca5bc1e/training/pango_font_info.cpp#L367

You need to use text2image with the flag only_extract_font_properties to trigger the function in which this code lives.

amitdo avatar Nov 07 '16 20:11 amitdo

The dotted_circle changes in #381 caused problems (in Linux at least). See: https://github.com/tesseract-ocr/tesseract/blob/5bb97f966885/training/pango_font_info.cpp#L438

amitdo avatar Nov 23 '16 12:11 amitdo

@vidiecan : Are points 2. and 3. still valid? If yes, do you have PR for it?

zdenop avatar Oct 20 '19 15:10 zdenop

The relevant code was rewritten in Tesseract 5.0.

@stweil,

Do you know if all the issues that were mentioned by the OP were solved?

amitdo avatar Sep 09 '21 07:09 amitdo

No, I don't know that and would have to run tests first. @vidiecan, did you test with the latest installer for Windows?

stweil avatar Sep 09 '21 07:09 stweil