leptess
leptess copied to clipboard
Multi-page support (TIFF)
Hey.
Most OCR work I've seen so far uses (b/w, CCITT compressed) multi-page documents. I'd like to make these work with leptess, but it seems (unless I'm missing something?) that there's only support for Pix
(not: PixA
), nor a mapping for direct TIFF I/O (say pixaReadMultipageTiff
from Leptonica). The high level wrapper (leptess:LepTess) also doesn't expose a method to directly set_image
a Pix
, but that would be the most trivial thing to change.
In other words: I was hoping for a Rust (leptess) workflow that allows
- reading a multi-page TIFF as
PixA
- iterating over each page ->
Pix
and collecting the recognition results
Is that something you'd be willing to support? Am I missing a way how this would work today already? I could offer to look into this, but I admit that I'm a Rust beginner at this point in time.
Hey, I might be able to look at this but it wouldn't be until next weekend
I think this might be possible today using set_image_from_mem and the image crate but I haven't tried it.
Some notes for myself: https://tpgit.github.io/Leptonica/pix_8h_source.html#l00363 https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/tiffiostub.c#L99-L103
Thanks a ton for the reply. Looking at the linked image crate / into_bytes
it probably should NOT copy for this to be a decent workaround? Otherwise my naive understanding is that the image would be read once, then copied for each page (and .. anyway already re-read by leptonica).
Leptonica does provide the required functionality already, right? PixA
is a collection of Pix
/an "A"rray of Pix
that allows access to the individual entries (which could be passed to tess_api.set_image
directly, if that would be exposed in the high level LepTess
: This is already what's happening in set_image_from_mem anyway: Reading a buffer into a Pix
, then handing that to tesseract.
My armchair idea - and I would be willing to help where I can - is therefore that
- LepTess gets an overload for set_image_* that accepts a Pix
- the plumbing/wrapper/glue should expose
PixA
(maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?) - for this particular use case (which I argue is common for OCR though?) being able to directly read a multi-page TIFF into a PixA (either from file or from memory/a buffer would be cool. Like the existing pix_read and pix_read_mem
In this case there would be no need for another crate and it would probably avoid re-reading (and potentially copying) the image(s) around?
Hi,
I haven't forgotten about this.
I'm going to try and get to this step tonight
the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)
You may be interested in this PR. Github won't let me assign you as a reviewer.
https://github.com/ccouzens/leptonica-plumbing/pull/2