hocr-text: sort by bbox
broken hocr-editors can mess up the order of words and paragraphs this is an attempt to restore the correct order of words and paragraphs
example use
hocr-text --sort-words -f src.hocr >dst.txt
broken hocr-editors
offtopic: are there any good hocr-editors? similar to gImageReader
or maybe hocr2svg | inkscape | svg2hocr
There are a few more hOCR editors like https://github.com/not-implemented/hocr-proofreader and https://github.com/kba/hocrjs
As for the MR - thanks for the MR.
Please separate out the changes in the MR, the change to hocr-text to require the file argument can be merged now without issue, but the shebang changes I am a bit less fond off. Technically I think the codebase should still work with python2, or it has for a long time at least), and most systems that I know off have 'python' invoke 'python3' - is this not the case on your system?
The actual change, reordering the words based on their position I think will require some additional research/testing and probably should be optional (with a command line argument to hocr-text and an argument to hocr_page_text) because it might break some workflows.
fixed
Apologies for the delay in reviewing, I've had some other things going on.
Python has infinity encoded in math.inf, it might make more sense to use that value rather than an arbitrary high number. In general, for getting the bounding box, I wonder if min/max functions can be used on the values rather than the if statements? For example: x1 = min(x1, [dynamic for loop value]) - it might be tidier with list comprehension as well.
Let me know if you'd like me to make these changes instead.
I wonder if min/max functions can be used on the values rather than the if statements?
in #26 i used the if pattern again as for micro-optimizations, this is faster because it avoids function calls to min/max