Request for info: support for multi-page tiffs
Summary
-
Please confirm if support for multi-page TIFF files is present, perhaps using an option I cannot identify, or if this would require an enhancement.
-
When I extract text from a multi-page TIFF using
Text()it only extracts the text from the first page of the TIFF. -
When I extract text from a multi-page TIFF using the
tesseractcommand line client with defaults it extracts all pages of text. -
I looked at some of the tesseract source code for
pixReadMem()and I noticed this here:- https://tesseract-ocr.github.io/3.x/a00680_source.html#l01173
-
It looks like tesseract might do some additional preprocessing on the image prior to calling
pixReadMem().
Reproducibility
Reproducility Frequency
- 100%
How to reproduce
- Get a multipage tiff file.
- Run tesseract 3+ on it from the command line like so:
tesseract multipage.tif multipage.tif
- Examine output (
multipage.tif.txt) and notice text has been extracted from all pages of tif. - Next, set up a gosseract client and set the image using either SetImage() or SetImageFromBytes() on a multi-page .tif file. Extract text using Text().
client := gosseract.NewClient()
defer client.Close()
// client.SetImageFromBytes(*imgBytes)
client.SetImage("multipage.tif")
text, _ := client.Text()
fmt.Println(text)
- Examine output in
text. Notice only first page's text is returned.
Environment
uname -a
Darwin <<removed>> 17.7.0 Darwin Kernel Version 17.7.0 x86_64
go env
GOARCH="amd64"
GOBIN=""
GOCACHE="<<removed>>"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="<<removed>>"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=<<removed>>"
go version
go1.10.3 darwin/amd64
tesseract --version
tesseract 3.05.02
leptonica-1.76.0
libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11
@evu Thanks. Could you give me any multi-paged tiff file as an example for development
ping @evu
http://www.nightprogrammer.org/wp-uploads/2013/02/multipage_tiff_example.tif
thx
- https://github.com/tesseract-ocr/tesseract/wiki/APIExample
- https://tesseract-ocr.github.io/3.x/index.html
- https://tesseract-ocr.github.io/3.x/a01281.html#ga551aa98cd0a9957195f83729a599a89f
- https://github.com/tesseract-ocr/tesseract/issues/233
- https://www.google.co.jp/search?q=tesseract-ocr+multi+page+tiff+c%2B%2B+api&oq=tesseract-ocr+multi+page+tiff+c%2B%2B+api&aqs=chrome..69i57.13967j0j7&sourceid=chrome&ie=UTF-8
- https://github.com/tesseract-ocr/tesseract/issues/1138#issuecomment-330278261
- https://stackoverflow.com/questions/22691377/read-tiff-image-tesseract-and-leptonica
- https://stackoverflow.com/questions/46283216/how-to-get-text-for-multi-page-tiff-using-tesseract-capi?rq=1
I'm having the same problem where I'm trying to extract text from a multi-page .tiff file, only first page is extracted. The same problem also exists in the case of a .png file. Would appreciate any help :)