leptess Multi-page support (TIFF)

Hey.

Most OCR work I've seen so far uses (b/w, CCITT compressed) multi-page documents. I'd like to make these work with leptess, but it seems (unless I'm missing something?) that there's only support for Pix (not: PixA), nor a mapping for direct TIFF I/O (say pixaReadMultipageTiff from Leptonica). The high level wrapper (leptess:LepTess) also doesn't expose a method to directly set_image a Pix, but that would be the most trivial thing to change.

In other words: I was hoping for a Rust (leptess) workflow that allows

reading a multi-page TIFF as PixA
iterating over each page -> Pix and collecting the recognition results

Is that something you'd be willing to support? Am I missing a way how this would work today already? I could offer to look into this, but I admit that I'm a Rust beginner at this point in time.

May 08 '22 08:05 darklajid

Hey, I might be able to look at this but it wouldn't be until next weekend

I think this might be possible today using set_image_from_mem and the image crate but I haven't tried it.

Some notes for myself: https://tpgit.github.io/Leptonica/pix_8h_source.html#l00363 https://github.com/DanBloomberg/leptonica/blob/5aaf1c187deeef7f47288c6b0833a07021940da7/src/tiffiostub.c#L99-L103

May 08 '22 11:05 ccouzens

Thanks a ton for the reply. Looking at the linked image crate / into_bytes it probably should NOT copy for this to be a decent workaround? Otherwise my naive understanding is that the image would be read once, then copied for each page (and .. anyway already re-read by leptonica).

Leptonica does provide the required functionality already, right? PixA is a collection of Pix/an "A"rray of Pix that allows access to the individual entries (which could be passed to tess_api.set_image directly, if that would be exposed in the high level LepTess: This is already what's happening in set_image_from_mem anyway: Reading a buffer into a Pix, then handing that to tesseract.

My armchair idea - and I would be willing to help where I can - is therefore that

LepTess gets an overload for set_image_* that accepts a Pix
the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)
for this particular use case (which I argue is common for OCR though?) being able to directly read a multi-page TIFF into a PixA (either from file or from memory/a buffer would be cool. Like the existing pix_read and pix_read_mem

In this case there would be no need for another crate and it would probably avoid re-reading (and potentially copying) the image(s) around?

May 09 '22 05:05 darklajid

Hi,

I haven't forgotten about this.

I'm going to try and get to this step tonight

the plumbing/wrapper/glue should expose PixA (maybe even as an iterator, but even just accessing the count and the entries first, like a read-only implementation to reduce the work required?)

May 16 '22 07:05 ccouzens

You may be interested in this PR. Github won't let me assign you as a reviewer.

https://github.com/ccouzens/leptonica-plumbing/pull/2

May 16 '22 22:05 ccouzens

leptess leptess copied to clipboard

Multi-page support (TIFF)

leptess
leptess copied to clipboard