Check whether tesseract supports jpeg2000 or not
pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadStreamJp2k: function not present Error in pixReadStream: jp2: no pix returned Error in pixRead: pix not read Error during processing.')
pytesseract 0.3.10 tesseract 5.1.0 pillow 9.0.1 openjpeg2 2.4.0 pytest 7.1.0 python 3.10.2
Old title: test_image_to_string_with_image_type[jpeg2000] failure with tesseract >4.1.x
This error is related to tesseract itself - which version that? Also, is there a sample image that causes that error?
Oh right: tesseract 5.1.0
The image used by the test: https://github.com/madmaze/pytesseract/blob/v0.3.10/tests/data/test.jpeg2000
Well, hmmm. CI on master passes, so not shure what is going on there. PS: Yep, your tesseract version is new enough and CI still uses 4.1.x
At this point, I would check what changed in 5.1.0 in order to not support jpeg2000, because clearly 4.x works with jpeg2000. It might be the imaging library support in Tesseract or something like that.
Have you tried using tesseract directly with the jpeg2000 image?
Have you tried using tesseract directly with the jpeg2000 image?
I haven't yet used tesseract, I only build pytesseract to provide as an optional dependency for urlwatch in the Arch repos.
At the moment, I don't have tesseract 5.1.0 around + Arch instance in order to test if it is pytesseract related or tesseract specific issue. When I have time, I will try to boot up a container with that setup in order to check.
Same issue here. I debugged it, and in my case the root cause was determined as follows:
- tesseract 5.1.0 (on FreeBSD 13.0 amd64) failed to process a JPEG2000 file, because:
- tesseract uses leptonica for reading images; and
- leptonica was compiled without OPENJPEG option, omitting the libopenjp2 library
The remedy for me was to recompile leptonica with OpenJPEG 2.4.0 support.
However for py-pytesseract, it should skip the test if there are indications that tesseract does not support JPEG2000.
Thank you for investigating that @mandree - I am not sure if there is a nice way to ask tesseract if that is the case or not.
Sadly pytesseract is designed as a thin wrapper around the tesseract executable and doesn't provide any feel integration.
You can query tesseract with -v or --version apparently.
See the line right below leptonica, it mentions liboopenjp2 (or not).
First two examples from FreeBSD 13.0 amd64, third and last example on Fedora 35 x86_64.
With JPEG2000 support:
$ tesseract -v
tesseract 5.1.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37+apng : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found OpenMP 201811
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
Found libcurl/7.82.0 OpenSSL/1.1.1k zlib/1.2.11 libssh2/1.10.0 nghttp2/1.46.0
And without:
$ tesseract -v
tesseract 5.1.0
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37+apng : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
Found OpenMP 201811
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
Found libcurl/7.82.0 OpenSSL/1.1.1k zlib/1.2.11 libssh2/1.10.0 nghttp2/1.46.0
Fedora Linux:
$ tesseract -v
tesseract 4.1.3
leptonica-1.81.1
libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
Found AVX2
Found AVX
Found FMA
Found SSE