tess4j
tess4j copied to clipboard
Lost support for Tesseract 5.3.4
Thanks for the new release! However, it seems to break on Ubuntu where the latest Tesseract version available is 5.3.4:
java.lang.UnsatisfiedLinkError: Unable to load library 'leptonica':
libleptonica.so: cannot open shared object file: No such file or directory
The stack trace is:
at com.sun.jna.NativeLibrary.loadLibrary(NativeLibrary.java:325)
at com.sun.jna.NativeLibrary.getInstance(NativeLibrary.java:481)
at com.sun.jna.Native.register(Native.java:1770)
at com.sun.jna.Native.register(Native.java:1489)
at net.sourceforge.lept4j.Leptonica1.<clinit>(Leptonica1.java:41)
at net.sourceforge.lept4j.util.LeptUtils.convertImageToPix(LeptUtils.java:104)
at net.sourceforge.lept4j.util.LeptUtils.convertImageToPix(LeptUtils.java:92)
at net.sourceforge.tess4j.Tesseract.setImage(Tesseract.java:391)
at net.sourceforge.tess4j.Tesseract.getWords(Tesseract.java:706)
at net.sourceforge.tess4j.ITesseract.getWords(ITesseract.java:275)
...
I'm not sure you want to fix that and would totally understand if you plan on supporting only the latest Tesseract.
libleptonica.so, or a symbolic link to it, is created during installation of tesseract. Can you verify its existence to make sure tesseract has been properly installed?
The package to install on Ubuntu is tesseract-ocr-all. It depends on tesseract-ocr which in turn depends on liblept5. There is a package in Ubuntu libleptonica-dev but it's not a dependency of tesseract. And even if I install it it doesn't seem to change much.
Installation of Tesseract and its dependency Leptonica should create a libleptonica.so symbolic link, pointing to the actual liblept.so.x.x.x file, in the system path. If it did not, you may have to manually create it.
I have the same issue on MAC tess4j 5.15.0
@ABHammad The program looks for a libtesseract.dylib to load. So make sure it exists and is in the system path.
https://sourceforge.net/p/tess4j/discussion/1202294/thread/4ec2dbbe87/ https://sourceforge.net/p/tess4j/discussion/1202294/thread/c9bc74d9/ https://stackoverflow.com/questions/21394537/tess4j-unsatisfied-link-error-on-mac-os-x
I already did that, and it was working with version. 5.13 and 5.14 but with 5.15 can't find "leptonica" library
I can confirm this as well. 5.14 works, but 5.15 does not work within Docker. The dockerfile is basically:
FROM ubuntu:24.04
RUN apt-get update
RUN apt-get install tesseract-ocr-all -y
RUN apt-get install openjdk-21-jdk -y
...
ENTRYPOINT ["java", "-jar", "/usr/local/ocr.jar"]
Also it does not work with ubuntu:25.04, and gives "libleptonica.so.6: undefined symbol: readResolutionMemJp2k" but that is not a problem with this wrapper I assume.
I guess the reason is the update of lep4j from 1.20.0 to 1.21.0 https://github.com/nguyenq/tess4j/commit/648d2a27d9702c4ee832829bf791ad79088e1a05#diff-9c5fb3d1b7e3b0f54bc5c4182965c4fe1f9023d449017cece3005d3f90e8e4d8L228 which update net.java.dev.jna from 5.15.0 to 5.16.0
There is a never 5.17.0 available from it though. I wonder if @matthiasblaesing knows if something change from 5.15.0 to 5.16.0 in jna.
Tested on Ubuntu and found problems in Tess4J:
- Library in tess4j does not match library name in distribution. Ubuntu 24.10 comes with liblept5 I get:
matthias@enterprise:~/tmp/leptonica/lib/shared$ ls -lh /usr/lib/x86_64-linux-gnu/liblept*
lrwxrwxrwx 1 root root 16 Mär 31 2024 /usr/lib/x86_64-linux-gnu/liblept.so.5 -> liblept.so.5.0.4
-rw-r--r-- 1 root root 2,6M Mär 31 2024 /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4
matthias@enterprise:~/tmp/leptonica/lib/shared$
building from source for leptonica yields:
matthias@enterprise:~/tmp/leptonica/lib/shared$ ls -lh
insgesamt 3,1M
lrwxrwxrwx 1 matthias matthias 12 Apr 12 20:33 liblept.so -> liblept.so.1
lrwxrwxrwx 1 matthias matthias 17 Apr 12 20:33 liblept.so.1 -> liblept.so.1.85.1
-rwxrwxr-x 1 matthias matthias 3,1M Apr 12 20:33 liblept.so.1.85.1
matthias@enterprise:~/tmp/leptonica/lib/shared$
This declaration is just broken:
https://github.com/nguyenq/lept4j/blob/4f762f91ba95e702d8f7e7e3acd48ca5412bf063/src/main/java/net/sourceforge/lept4j/util/LoadLibs.java#L54
That should be
public static final String LIB_NAME_NON_WIN = "lept";
You can work around this by making the library under the wrong name:
sudo ln -s /usr/lib/x86_64-linux-gnu/liblept.so.5 /usr/lib/x86_64-linux-gnu/libleptonica.so.5
- The leptonica bindings bind functions, that are not present in the library on Ubuntu. When I run it on 24.10 I get:
Exception in thread "main" java.lang.UnsatisfiedLinkError: Error looking up function 'pixAddMultipleBlackWhiteBorders': /lib/x86_64-linux-gnu/liblept.so.5: undefined symbol: pixAddMultipleBlackWhiteBorders
I build leptonica from source and pointed jna.library.path to that build (and applied the naming hack) and then Tess4J works.
If you don't need the function and only bound for completeness, you can drop the direct mapping and use the interface mapping. Then you can do this on Ubuntu 24.10:
package eu.doppelhelix.test.tesstest;
import com.sun.jna.Library;
import com.sun.jna.Native;
import com.sun.jna.Pointer;
import net.sourceforge.lept4j.Pix;
public class TessTest {
public static void main(String[] args) {
Leptonica instance = Native.load("lept", Leptonica.class);
System.out.println(instance.getLeptonicaVersion().getString(0));
// System.out.println(instance.pixAddMultipleBlackWhiteBorders(null, 0, 0, 0, 0, 0, 0));
}
static interface Leptonica extends Library {
public Pointer getLeptonicaVersion();
public Pix pixAddMultipleBlackWhiteBorders(Pix pixs, int nblack1, int nwhite1, int nblack2, int nwhite2, int nblack3, int nwhite3);
}
}
This gives me:
leptonica-1.82.0
If you uncomment the call to pixAddMultipleBlackWhiteBorders, you'll get:
Exception in thread "main" java.lang.UnsatisfiedLinkError: Error looking up function 'pixAddMultipleBlackWhiteBorders': /usr/lib/x86_64-linux-gnu/liblept.so.5.0.4: undefined symbol: pixAddMultipleBlackWhiteBorders
at com.sun.jna.Function.<init>(Function.java:255)
at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:618)
at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:594)
at com.sun.jna.NativeLibrary.getFunction(NativeLibrary.java:580)
at com.sun.jna.Library$Handler.invoke(Library.java:248)
at eu.doppelhelix.test.tesstest.$Proxy0.pixAddMultipleBlackWhiteBorders(Unknown Source)
at eu.doppelhelix.test.tesstest.TessTest.main(TessTest.java:12)
Last time I tried, Tesseract installation would also install Leptonica dependency, which is accessed during runtime through libleptonica.so symbolic link. If the newest installation changed the Leptonica name to liblept.so, then tess4j's lept4j dependency would not be able to find it. Perhaps, a simple workaround could be done by creating a libleptonica.so link pointing to liblept.so.
If you encountered undefined symbol (for example, readResolutionMemJp2, pixAddMultipleBlackWhiteBorders, etc.), it is because the lept4j version you use does not match the installed Leptonica version, which does not support those new methods. Make sure you use a compatible lept4j version as by specifying it explicitly in the pom.xml file.
As part of our investigation of the reported issues on Ubuntu, we tried to get VietOCR, a GUI client for Tesseract, to work in WSL for Ubuntu 24.04.2 LTS. What we observe is that it is not a straightforward process. We will document all these findings in a separate permanent page. Hope this will help the users and developers to resolve issues with getting tess4j to play nice with their tesseract installation.
$ sudo apt-get install tesseract-ocr tesseract-ocr-eng
$ tesseract -v
tesseract 5.5.0-48-gf96c
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Installation of the language packs has placed the *.trainnedata in /usr/share/tesseract-ocr/5/tessdata directory.
$ tesseract --list-langs
List of available languages in "/usr/local/share/" (0).
That means the language pack installation has not put the data files in the directory that tesseract program expected.
$ sudo cp /usr/share/tesseract-ocr/5/tessdata/*.trainneddata /usr/local/share/tessdata/
A client program might not be aware that the language data files are located in /usr/local/share/tessdata/. It may rely on TESSDATA_PREFIX environment variable to tell it exactly where to find the files. We need to set and export it.
Open ~/.profile file and put at the bottom:
export TESSDATA_PREFIX=/usr/local/share/
Restart Ubuntu to make the change go into effect.
OCR from VietOCR generated an exception:
java.lang.UnsatisfiedLinkError: Error looking up function 'pixBackgroundNormTo1MinMax': /lib/x86_64-linux-gnu/liblept.so.5: undefined symbol: pixBackgroundNormTo1MinMax
The error stems from the fact that the currently installed leptonica-1.82.0 does not have support for the API method pixBackgroundNormTo1MinMax. VietOCR bundles lept4j-1.21.0 as a dependency, which is a Java binding for leptonica-1.85.0. We need to download and use a lept4j version that is compatible with leptonica-1.82.0; that version would be lept4j-1.16.x.
For client applications that have a pom.xml, you'd need to explicitly specify the appropriate lept4j version compatible with your leptonica installation.
We observe that leptonica installation indeed created a libleptonica.so link pointing to liblept.so, which itself is also a link eventually to the leptonica binary.