tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Add support for Unicode filenames on MS Windows

Open stweil opened this issue 2 years ago • 10 comments

Tesseract currently has problems when the path of the executable contains Unicode characters which are not supported by the current code page.

I also expect problems for any filenames given to Tesseract (for example image names) which include such characters.

See pull request #3708 which triggered this issue.

stweil avatar Jan 04 '22 16:01 stweil

I just tested tesseract on Windows 10 with a path which contains Chinese characters.

Normally argv[0] which is passed as an argument to function main contains tesseract.exe with the full path. This works as long as all characters from the path are included in the code page. Chinese characters are not in that code page, and obviously they are replaced by ? characters.

So tesseract could simply check for any ? in argv[0] and abort with an error message if one is found.

That would not add support for Unicode paths but at least avoid some problems.

stweil avatar Jan 04 '22 16:01 stweil

Generally I am not sure whether it is worth to support that feature. It is easy to avoid filenames and paths with problematic Unicode characters.

stweil avatar Jan 04 '22 16:01 stweil

https://github.com/DanBloomberg/leptonica/issues/537#issuecomment-691714238

amitdo avatar Jan 04 '22 21:01 amitdo

Regarding Leptonica: maybe Tesseract can use pixReadStream, which accepts FILE*.

danpla avatar Jan 04 '22 21:01 danpla

@stweil

Normally argv[0] which is passed as an argument to function main contains tesseract.exe with the full path. This works as long as all characters from the path are included in the code page. Chinese characters are not in that code page, and obviously they are replaced by ? characters.

So tesseract could simply check for any ? in argv[0] and abort with an error message if one is found.

That would not add support for Unicode paths but at least avoid some problems.

Forget it. This is not reliable. ? is normally used in decode/encode/recode as the default in Latin-based 8-bit encodings and can mean: invalid encoding, broken encoding, character not assigned, glyph not in font. In other combinations of en-/de-coding, string manipulation of software involved, font (.undef glyph) and renderer (they sometimes ignore the font) you can see other symbols. If you want to waste your time or must analyse the problem only a hexdump of the string helps, to see what it really is. Just bail out or die with a message, if the file could not be found or opened. Users with this problem should ascify or slugify their filenames to [a-z0-9._+-] - yes, no spaces, no uppercase, no "special" characters, no punctuation except ., no escaping.

wollmers avatar Jan 05 '22 14:01 wollmers

I agree. It would not be sufficient to upgrade the Tesseract code for full Unicode support on Windows, because all libraries (Leptonica, graphic libraries, libarchive, ...) have the same problem.

Windows is simply a nightmare regarding standard support. And as you said, Tesseract will report if it cannot find or open a file, and it is easy for users to avoid the problem.

stweil avatar Jan 05 '22 15:01 stweil

Use UTF-8 code pages in Windows apps

Regarding older Windows versions: Windows 7 is EOL since January 2020. Windows 8.1 has a small market share (compared to other Windows versions) and will reach EOL in January 2023.

amitdo avatar Feb 09 '22 18:02 amitdo

https://github.com/tesseract-ocr/tesseract/pull/3708#issuecomment-1162338217

amitdo avatar Jun 23 '22 18:06 amitdo

Should we add a manifest file tor Unicode support on Windows 10/11?

amitdo avatar Dec 06 '23 09:12 amitdo

Yes, I think so. But then we also need build rules which add the manifest to tesseract.exe (and all other executables?). And that build rule should not depend on Microsoft's mt.exe, but use some alternative which also works in cross builds running on Linux.

stweil avatar Dec 06 '23 17:12 stweil