tess4j
tess4j copied to clipboard
Tess4j - Error opening tessdata file by non-ASCII path
OS: Windows 10 IDE: IntelliJ tess4j: 4.5.1
I have two folders on my disc with equal 'eng.traineddata' files:
c:/data/eng.traineddata
c:/дата/eng.traineddata
And tesseract fails while running next code:
Tesseract instance = new Tesseract();
// instance.setDatapath("c:/data"); // works without issues
instance.setDatapath("c:/дата"); // see Error message below
instance.setLanguage("eng");
String result = instance.doOCR(new File("c:/numbers.jpg"));
Error message:
Error opening data file c:/дата/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
The error is pretty clear: you can't have non-ASCII characters in tessdata path. 'д' is not an ASCII character.
@nguyenq thanks for the feedback! Could you provide a but more context here? Like if the root cause is on the Tesseract side or on the wrapper side, are there any workarounds available or any plans to support non-ASCII paths?
It could be JNA or it could be inside Tesseract native code. On Linux, Tesseract and its tessdata directory are placed in standard system directories, so I doubt Tesseract code would ever need to deal with non-ASCII characters in those paths.
On Windows, you may want to try with a relative path without containing non-ASCII characters to see if it would work.
Maybe related to Issue https://github.com/nguyenq/tess4j/issues/75.
Failure may happen when non-ascii exist in either source filename, data files names, or target filename. Meanwhile, same file names work when run tesseract command by ProcessBuilder.
You are right that the reason may be at java side when it handle filename with local API. A jdk bug: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8205991