tesseract lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output…

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output…

Open nickjwhite opened this issue 3 years ago • 5 comments

This ensures that transformations like unicode normalisation are done on the truth output as well as the OCR output, so that you can compare the two properly.

Before this a perfect OCR result could show different lines for Truth and OCR if the OCR output included characters that were normalised.

May 11 '21 09:05 nickjwhite

@nickjwhite Please provide a sample demonstrating this.

Before this a perfect OCR result could show different lines for Truth and OCR if the OCR output included characters that were normalised.

I had noticed this in the past but do not have any ready example to test and verify.

Sep 07 '21 16:09 Shreeshrii

Is this issue related?

Sep 07 '21 16:09 Shreeshrii

Is this issue related?

No, this issue looks more like the wrong normalisation form, which normalises long_s to s:

$ perl -e 'use utf8; use Unicode::Normalize; print NFC("ſ"),"\n";'
ſ
$ perl -e 'use utf8; use Unicode::Normalize; print NFKC("ſ"),"\n";'
s

Sep 07 '21 17:09 wollmers

Ok, I have a sample now.

eng Praja exp0_159

Ground Truth: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore OCR via CLI using custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore OCR via lstmeval using same custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)2 (p. 142) alaḵkaḻi- ... in the Coimbatore

Superscript 2 is getting normalized to number 2 for lstmeval.

Dec 06 '21 15:12 Shreeshrii

similarly for trademark symbol

san Guru_Italic 0000203 exp0_0

GT: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach CLI OCR: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach lstmeval OCR: TOPOGRAPHIC FASHIONABLE WETTER CoreTM?2 problem ALLOWED) *Call YOU, Kanpur coach

@stweil I have attached a zip file with the custom IAST traineddata.

IAST_0.267000_136760_880600.zip

Dec 06 '21 15:12 Shreeshrii

tesseract tesseract copied to clipboard

lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output…

tesseract
tesseract copied to clipboard