gImageReader icon indicating copy to clipboard operation
gImageReader copied to clipboard

Soft hyphen HTML entity in hOCR

Open tukusejssirs opened this issue 4 years ago • 0 comments

Related: #94


hOCR Living Standard suggests to use ­ HTML entity. However, when I insert ­ in a box in hOCR within gImageReader’s output pane, the source is changed to ­, which is not correct.

IMHO the (soft) hyphens at the end of a line should be automatically converted to ­ in hOCR and to regular hyphens in the text output. As far as I remember well, tesseract outputs these hyphens as soft hyphens (U+00AD); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.

For now, I need to replace the hyphens at the end of the lines with ­ using a text editor or sed:

  1. text editor (using regex):
find: -</span>\n\s*</span>
replace: &shy;</span>\n\s*</span>
  1. sed:
sed -zi 's|-</span>\n\s*</span>|\&shy\;</span>\n\s*</span>|g' <filename>

Update: When I replace those hyphens with &shy; outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped). Therefore these is some kind of other other too.

Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.

tukusejssirs avatar Jan 04 '21 11:01 tukusejssirs