gImageReader
gImageReader copied to clipboard
Soft hyphen HTML entity in hOCR
Related: #94
hOCR Living Standard suggests to use ­
HTML entity. However, when I insert ­
in a box in hOCR within gImageReader’s output pane, the source is changed to ­
, which is not correct.
IMHO the (soft) hyphens at the end of a line should be automatically converted to ­
in hOCR and to regular hyphens in the text output. As far as I remember well, tesseract
outputs these hyphens as soft hyphens (U+00AD
); see tesseract-ocr/tesseract#2161 (esp this comment of mine), which is a duplicate of tesseract-ocr/tesseract#728. Note that back then I considered soft hyphens a bad thing, but not anymore.
For now, I need to replace the hyphens at the end of the lines with ­
using a text editor or sed
:
- text editor (using regex):
find: -</span>\n\s*</span>
replace: ­</span>\n\s*</span>
-
sed
:
sed -zi 's|-</span>\n\s*</span>|\­\;</span>\n\s*</span>|g' <filename>
Update: When I replace those hyphens with ­
outside of gImageReader, and then try to load it in gImageReader, the program crashes (core dumped
). Therefore these is some kind of other other too.
Update 2: The core dump will be uploaded to Google Drive. Please be patient, as my Internet connect and its stability is not that good in my location.