pytesstrain
pytesstrain copied to clipboard
metrics: avoid CER > 1.0
https://github.com/wincentbalin/pytesstrain/blob/b6a85dec3a02b878f8cee7d8170a75e7dabaeca6/pytesstrain/metrics/cer.py#L6
This definition is common, but flawed IMHO: the numerator being a Levenshtein distance, i.e. a sum of costs along a path through the confusion matrix, the natural denominator for that is the length of that path. (Of course, the editdistance
package does not yield the actual alignment path, so you'll have to use a different library, like difflib.SequenceMatcher or rapidfuzz.levenshtein_editops).
For some discussion, see here and here.
Perhaps the different definitions (gt-ref / max-ref / pathlen) could be made optional?
As I do not have much time to solve this, would you like to contribute a solution?
I would indeed – just give me some time.