pytesstrain icon indicating copy to clipboard operation
pytesstrain copied to clipboard

metrics: avoid CER > 1.0

Open bertsky opened this issue 2 years ago • 2 comments

https://github.com/wincentbalin/pytesstrain/blob/b6a85dec3a02b878f8cee7d8170a75e7dabaeca6/pytesstrain/metrics/cer.py#L6

This definition is common, but flawed IMHO: the numerator being a Levenshtein distance, i.e. a sum of costs along a path through the confusion matrix, the natural denominator for that is the length of that path. (Of course, the editdistance package does not yield the actual alignment path, so you'll have to use a different library, like difflib.SequenceMatcher or rapidfuzz.levenshtein_editops).

For some discussion, see here and here.

Perhaps the different definitions (gt-ref / max-ref / pathlen) could be made optional?

bertsky avatar Mar 29 '22 16:03 bertsky

As I do not have much time to solve this, would you like to contribute a solution?

wincentbalin avatar Apr 27 '22 16:04 wincentbalin

I would indeed – just give me some time.

bertsky avatar Apr 27 '22 17:04 bertsky