tesseract
tesseract copied to clipboard
lstmeval: Improve output by ensuring 'Truth:' text is encoded the same way as OCR output…
This ensures that transformations like unicode normalisation are done on the truth output as well as the OCR output, so that you can compare the two properly.
Before this a perfect OCR result could show different lines for Truth and OCR if the OCR output included characters that were normalised.
@nickjwhite Please provide a sample demonstrating this.
Before this a perfect OCR result could show different lines for Truth and OCR if the OCR output included characters that were normalised.
I had noticed this in the past but do not have any ready example to test and verify.
Is this issue related?
Is this issue related?
No, this issue looks more like the wrong normalisation form, which normalises long_s to s:
$ perl -e 'use utf8; use Unicode::Normalize; print NFC("ſ"),"\n";'
ſ
$ perl -e 'use utf8; use Unicode::Normalize; print NFKC("ſ"),"\n";'
s
Ok, I have a sample now.
Ground Truth: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via CLI using custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)² (p. 142) alakkaḻi- ... in the Coimbatore
OCR via lstmeval using same custom IAST traineddata: aṇṇi- aṇṇi- , 11 v. 904)2 (p. 142) alaḵkaḻi- ... in the Coimbatore
Superscript 2 is getting normalized to number 2 for lstmeval.
similarly for trademark symbol
GT: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach CLI OCR: TOPOGRAPHIC FASHIONABLE WETTER Core™2 problem ALLOWED) *Call YOU, Kanpur coach lstmeval OCR: TOPOGRAPHIC FASHIONABLE WETTER CoreTM?2 problem ALLOWED) *Call YOU, Kanpur coach
@stweil I have attached a zip file with the custom IAST traineddata.