Summary

Please confirm if support for multi-page TIFF files is present, perhaps using an option I cannot identify, or if this would require an enhancement.
When I extract text from a multi-page TIFF using Text() it only extracts the text from the first page of the TIFF.
When I extract text from a multi-page TIFF using the tesseract command line client with defaults it extracts all pages of text.
I looked at some of the tesseract source code for pixReadMem() and I noticed this here:
- https://tesseract-ocr.github.io/3.x/a00680_source.html#l01173
It looks like tesseract might do some additional preprocessing on the image prior to calling pixReadMem().

Reproducibility

Reproducility Frequency

100%

How to reproduce

Get a multipage tiff file.
Run tesseract 3+ on it from the command line like so:

tesseract multipage.tif multipage.tif

Examine output (multipage.tif.txt) and notice text has been extracted from all pages of tif.
Next, set up a gosseract client and set the image using either SetImage() or SetImageFromBytes() on a multi-page .tif file. Extract text using Text().

client := gosseract.NewClient()
defer client.Close()

// client.SetImageFromBytes(*imgBytes)
client.SetImage("multipage.tif")
text, _  := client.Text()
fmt.Println(text)

Examine output in text. Notice only first page's text is returned.

Environment

uname -a
Darwin <<removed>> 17.7.0 Darwin Kernel Version 17.7.0 x86_64

go env
GOARCH="amd64"
GOBIN=""
GOCACHE="<<removed>>"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="<<removed>>"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=<<removed>>"

go version
go1.10.3 darwin/amd64

tesseract --version
tesseract 3.05.02
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11

Sep 14 '18 19:09 evu

@evu Thanks. Could you give me any multi-paged tiff file as an example for development

Oct 23 '18 03:10 otiai10

ping @evu

Nov 03 '18 17:11 otiai10

http://www.nightprogrammer.org/wp-uploads/2013/02/multipage_tiff_example.tif

Nov 05 '18 13:11 evu

thx

Nov 05 '18 13:11 otiai10

https://github.com/tesseract-ocr/tesseract/wiki/APIExample
https://tesseract-ocr.github.io/3.x/index.html
- https://tesseract-ocr.github.io/3.x/a01281.html#ga551aa98cd0a9957195f83729a599a89f
https://github.com/tesseract-ocr/tesseract/issues/233
https://www.google.co.jp/search?q=tesseract-ocr+multi+page+tiff+c%2B%2B+api&oq=tesseract-ocr+multi+page+tiff+c%2B%2B+api&aqs=chrome..69i57.13967j0j7&sourceid=chrome&ie=UTF-8
- https://github.com/tesseract-ocr/tesseract/issues/1138#issuecomment-330278261
- https://stackoverflow.com/questions/22691377/read-tiff-image-tesseract-and-leptonica
- https://stackoverflow.com/questions/46283216/how-to-get-text-for-multi-page-tiff-using-tesseract-capi?rq=1

Nov 05 '18 16:11 otiai10

I'm having the same problem where I'm trying to extract text from a multi-page .tiff file, only first page is extracted. The same problem also exists in the case of a .png file. Would appreciate any help :)

Nov 29 '21 06:11 filip-dahlberg

Request for info: support for multi-page tiffs

Summary

Reproducibility

Reproducility Frequency

How to reproduce

Environment