gcv2hocr
gcv2hocr copied to clipboard
gcv2ocr2.py output - Correct bbox for individual words, but "ocr_lines" completely busted
I had to manually specify the page_width and page_height to match my PDF images to get the words to align. I am sure the words are perfectly aligned by manually checking the coordinates for each word, but the ocr_lines have coordinates that seem to follow the coordinates of the last word of the previous sentence like so:
#sentence 1
<span class='ocrx_word' id='word_2_1_50' title='bbox 658 495 664 518'>”</span>
<span class='ocrx_word' id='word_2_1_51' title='bbox 675 495 691 518'>of</span>
<span class='ocrx_word' id='word_2_1_52' title='bbox 698 495 785 518'>Leninism</span>
**<span class='ocrx_word' id='word_2_1_53' title='bbox 789 495 791 518'>,</span>**
</span>
#sentence2
**<span class='ocr_line' id='line_2_1_4' title='bbox 789 495 791 518; baseline 0 0'>**
<span class='ocrx_word' id='word_2_1_54' title='bbox 120 522 172 548'>social</span>
<span class='ocrx_word' id='word_2_1_55' title='bbox 183 522 283 548'>democracy</span>
<span class='ocrx_word' id='word_2_1_56' title='bbox 285 522 289 548'>,</span>
<span class='ocrx_word' id='word_2_1_57' title='bbox 297 522 316 548'>or</span>
I haven't been able to figure out the significance of "baseline", should I be tweaking those to get correct lines?
Hey @hengyu95 Quick Question! Is this bug still there in gcv2hocr2.py if no, then can you share some code outline or a gist to your edited script. I have updated my own to incorporate many improvement and I am interested in yours too. Share it here so I can improve. :)