autofix tesseract OCR output of a scanned book with the expected text from an EPUB file of the same book

Open milahu opened this issue 2 months ago • 0 comments

i have two versions of the same book

a EPUB version
a HOCR version created by tesseract from scanned images (TIFF files), which i want to convert to a searchable PDF file (page images with a transparent text layer)

problem: tesseract makes many mistakes when recognizing text

bad solution: manually proofread the HOCR files

wanted solution: automatically fix the almost-correct text in the HOCR files using the correct text in the EPUB file. aka: automatic proofreading of HOCR files with a known expected text

this would also require alignment of similar texts (sequence alignment), a problem which i already have encountered (and somewhat solved) in my translate-richtext project, where i use a character-diff to align two similar texts:

git diff --word-diff=color --word-diff-regex=. --no-index \
  $(readlink -f translation.joined.txt) \
  $(readlink -f translation.splitted.txt) |
sed -E $'s/\e\[32m.*?\e\[m//g; s/\e\\[[0-9;:]*[a-zA-Z]//g' |
tail -n +6 >translation.aligned.txt

other possible solutions: passim and text-pair

the alignment of similar texts can produce new mistakes, so it should be easy to manually inspect and fix the alignments (semi-automatic solution)

the solution should be implemented in a python script, to make it easy to customize

crossposted to reddit and stackexchange

Oct 22 '25 08:10 milahu