gImageReader Soft hyphen HTML entity in hOCR

Soft hyphen HTML entity in hOCR

Open tukusejssirs opened this issue 4 years ago • 0 comments

Related: #94

hOCR Living Standard suggests to use  HTML entity. However, when I insert  in a box in hOCR within gImageReader’s output pane, the source is changed to &shy;, which is not correct.

IMHO the (soft) hyphens at the end of a line should be automatically converted to  in hOCR and to regular hyphens in the text output. As far as I remember well, tesseract outputs these hyphens as soft hyphens (U+00AD); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.

For now, I need to replace the hyphens at the end of the lines with  using a text editor or sed:

text editor (using regex):

find: -</span>\n\s*</span>
replace: &shy;</span>\n\s*</span>

sed:

sed -zi 's|-</span>\n\s*</span>|\&shy\;</span>\n\s*</span>|g' <filename>

Update: When I replace those hyphens with  outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped). Therefore these is some kind of other other too.

Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.

Jan 04 '21 11:01 tukusejssirs

gImageReader gImageReader copied to clipboard

Soft hyphen HTML entity in hOCR

gImageReader
gImageReader copied to clipboard