archive-hocr-tools icon indicating copy to clipboard operation
archive-hocr-tools copied to clipboard

hocr-text: sort by bbox

Open milahu opened this issue 5 months ago • 6 comments

broken hocr-editors can mess up the order of words and paragraphs this is an attempt to restore the correct order of words and paragraphs

example use

hocr-text --sort-words -f src.hocr >dst.txt

broken hocr-editors

offtopic: are there any good hocr-editors? similar to gImageReader or maybe hocr2svg | inkscape | svg2hocr

milahu avatar Aug 18 '25 08:08 milahu

There are a few more hOCR editors like https://github.com/not-implemented/hocr-proofreader and https://github.com/kba/hocrjs

MerlijnWajer avatar Aug 18 '25 09:08 MerlijnWajer

As for the MR - thanks for the MR.

Please separate out the changes in the MR, the change to hocr-text to require the file argument can be merged now without issue, but the shebang changes I am a bit less fond off. Technically I think the codebase should still work with python2, or it has for a long time at least), and most systems that I know off have 'python' invoke 'python3' - is this not the case on your system?

The actual change, reordering the words based on their position I think will require some additional research/testing and probably should be optional (with a command line argument to hocr-text and an argument to hocr_page_text) because it might break some workflows.

MerlijnWajer avatar Aug 18 '25 09:08 MerlijnWajer

fixed

milahu avatar Aug 18 '25 10:08 milahu

Apologies for the delay in reviewing, I've had some other things going on.

Python has infinity encoded in math.inf, it might make more sense to use that value rather than an arbitrary high number. In general, for getting the bounding box, I wonder if min/max functions can be used on the values rather than the if statements? For example: x1 = min(x1, [dynamic for loop value]) - it might be tidier with list comprehension as well.

Let me know if you'd like me to make these changes instead.

MerlijnWajer avatar Oct 10 '25 10:10 MerlijnWajer

if you'd like me to make these changes

yes please, i dont need this any more since hocr-editor-qt

milahu avatar Oct 10 '25 11:10 milahu

I wonder if min/max functions can be used on the values rather than the if statements?

in #26 i used the if pattern again as for micro-optimizations, this is faster because it avoids function calls to min/max

milahu avatar Oct 25 '25 18:10 milahu