benchmarkstt
benchmarkstt copied to clipboard
Special normalisation rules
How to deal with:
- numbers
- acronyms / symbols
- website / email spellings (e.g. use "dot", "at" )
I would go for trying as much as possible to have letter-based normalised representations of all the above such as:
- 100 -> one hundred (cento, cent)
- Hz -> (hertz), WHO -> double u aitch o (less sure about this one ...)
- www.rai.it -> vu vu vu punto rai punto it, [email protected] -> pippo at pluto dot com
of course this would be for the sake of comparison, no one would really like to have such transcripts as a final product ... we don't even need to output normalised text if not for a debug session.
I would do the opposite actually in all cases, i.e. going from complexity to simplicity.
- one hundred -> 100
- hertz -> Hz, double u aitch o (not sure any stt actually outputs this)
- vu vu vu punto ... (same as 2, not sure any stt actually outputs this)
especially in the case of the vu vu vu punto rai punto it, if this is transcribed wrongly, eg. "vuvuvupunto raai punto it", it would impact WER heavily as it has 5 "words" wrong, while imho this should only count as one "word", and one point "penalty" on the WER score...
I'm concerned that specific normalisation discussions can send up down a very deep rabbit hole. Like @amessina71 says, the thing to remember is that we are not interested in absolute WER (compared to reference) but in relative WER (compared to other vendors). So as long as we apply the normalisation consistently it's not so important how we normalise. My preference would be to have a very small number of core normalisation rules that each user can add to. We also don't have to reinvent the wheel. There may be something we can use from these sources: https://www.kaggle.com/headsortails/watch-your-language-update-feature-engineering, https://github.com/google/sparrowhawk
@MikeSmithEU I actually meant the opposite too. Engines would certainly output "www.rai.it" or "100". The problem is that there might be slight differences with one another. One can output "www.rai.it" the other "www dot rai dot it". How to compare them? Again, one could output "100" the other "one hundred". So, by having a common denominator for these kinds of normalisations would make that engine 1 saying "There were 100 members attending" and engine 2 saying "There were one hundred persons attending" would be considered equivalent.