ocrs icon indicating copy to clipboard operation
ocrs copied to clipboard

Support tolerance for horizontal whitespace

Open WolfeCub opened this issue 6 months ago • 6 comments

If I have some text formatted as follows:

something     blah               20
other         thing              15
it            foo                 7

What find_text_lines often seems to do is give me lines like:

something
other
it
blah
thing
foo
20
15
7

Ideally I would be able to specify some tolerance of what's considered the same line but just white-space.

Apologies if I'm missing something obvious that already exists.

WolfeCub avatar Jul 07 '25 02:07 WolfeCub

Roughly speaking the layout algorithm assumes that your image is arranged into a hierarchy of lines -> paragraphs -> columns and tries to arrange words in reading order. There is currently no configuration option to tell layout analysis to assume a single column or specify a minimum spacing between columns (which you could set to infinite to force a single column output). It would make sense to have that option. The function that finds column separators is find_block_separators.

robertknight avatar Jul 07 '25 06:07 robertknight

Is it possible to write a custom layout algorithm reusing the baseline features? It seems most of the internals aren't exposed.

WolfeCub avatar Jul 07 '25 18:07 WolfeCub

find_text_lines is a function that takes a &[RotatedRect] of un-sorted word rects and returns a Vec<Vec<RotatedRect>> of paragraphs sub-divided into lines. You can replace that with any function you like that groups rects. The simplest implementation would be to copy and paste the group_into_lines function into your project (along with the functions it uses from geom_util.rs) and pass it an empty separators argument.

robertknight avatar Jul 07 '25 18:07 robertknight

I appreciate the info. Giving that a try results in a fair amount of content just being missed. I'm guessing it doesn't like having so much white space but I'm not sure.

WolfeCub avatar Jul 07 '25 19:07 WolfeCub

Can you upload a couple of sample images of the text you are trying to recognize?

robertknight avatar Jul 07 '25 20:07 robertknight

It's a few random receipts off the internet:

Image Image

I'm trying to just get each line of the receipt line by line. For example all this text should be together:

Image

I appreciate you taking a look. Thanks for your time :)

WolfeCub avatar Jul 09 '25 14:07 WolfeCub