tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

tesseract only processing first page of each tif when given text file with list of multipage tifs as input

Open Shreeshrii opened this issue 2 years ago • 2 comments

tesseract -v
tesseract 5.0.0-18-g771c1
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found NEON
 Found OpenMP 201511
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
 Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3

I was trying to OCR a list of multipage tifs by giving a list of images as input. Tesseract is only processing the first page per image when a list is given.

Shreeshrii avatar Dec 28 '21 04:12 Shreeshrii

Versions used on Debian 10.9:

  1. tesseract 4.0.0 (*)
  2. leptonica-1.76.0
  3. libgif 5.1.4
  4. libjpeg 6b (libjpeg-turbo 1.5.2)
  5. libpng 1.6.36
  6. libtiff 4.1.0
  7. zlib 1.2.11
  8. libwebp 0.6.1
  9. libopenjp2 2.3.0

We had this with several multipage TIFFs handed over via command line, and then could not reproduce it. Command line was like: /usr/bin/tesseract -l "deu" "/path/to/input.tif" "/path/to/target.hocr" hocr 1>>logfile.log 2>&1 The weird thing is, I can not reproduce it with the same tiff. It did write a HOCR file for the first page. It was not truncated or in any case botched or crooked, it was a complete and well formed HOCR file. The tiff has 86 pages, and whenever I try again, tesseract (same version, no system updates or anything like that in between) processes all 86 pages. A diff -u between the old and the new only shows that the next results all have 85 more referenced pages, the first one is even the same.

There was nothing written to the log but the tesseract version and the processing of the first page. No error, no disturbance. Plenty of disk space and memory were also free.

So the issue seems to be random and not being related to the form of input other than multi-page tiffs are used. (*) As the OP uses tesseract 5, I daresay this issue with multi-page tiffs is either related to libtiff (them 4.0.9, we 4.1.0) or is present for at least since tesseract-4.0.0.

Yamakuzure avatar Jul 08 '22 08:07 Yamakuzure

@Shreeshrii: problem is that you are misusing tesseract the text file list feature. It is designed to replace multipage tif with one-page image format and it should not be used with multipage tiff (or another text file list). Tesseract is OCR engine and you need to do image preprocessing (e.g. split multipage tiff or append/prepend other images) by yourself and in advance of OCR process.

zdenop avatar Jul 09 '22 09:07 zdenop