Support tolerance for horizontal whitespace
If I have some text formatted as follows:
something blah 20
other thing 15
it foo 7
What find_text_lines often seems to do is give me lines like:
something
other
it
blah
thing
foo
20
15
7
Ideally I would be able to specify some tolerance of what's considered the same line but just white-space.
Apologies if I'm missing something obvious that already exists.
Roughly speaking the layout algorithm assumes that your image is arranged into a hierarchy of lines -> paragraphs -> columns and tries to arrange words in reading order. There is currently no configuration option to tell layout analysis to assume a single column or specify a minimum spacing between columns (which you could set to infinite to force a single column output). It would make sense to have that option. The function that finds column separators is find_block_separators.
Is it possible to write a custom layout algorithm reusing the baseline features? It seems most of the internals aren't exposed.
find_text_lines is a function that takes a &[RotatedRect] of un-sorted word rects and returns a Vec<Vec<RotatedRect>> of paragraphs sub-divided into lines. You can replace that with any function you like that groups rects. The simplest implementation would be to copy and paste the group_into_lines function into your project (along with the functions it uses from geom_util.rs) and pass it an empty separators argument.
I appreciate the info. Giving that a try results in a fair amount of content just being missed. I'm guessing it doesn't like having so much white space but I'm not sure.
Can you upload a couple of sample images of the text you are trying to recognize?
It's a few random receipts off the internet:
I'm trying to just get each line of the receipt line by line. For example all this text should be together:
I appreciate you taking a look. Thanks for your time :)