multilingual_kws
multilingual_kws copied to clipboard
expand and validate text normalization/cleaning filters
given two transcripts 1. [hello is a common greeting] and 2. [she said, “hello”], without punctuation filtering we would otherwise treat [hello] and [“hello”] as separate words
see which languages this impacts - Sharad says maybe 1K words total
v1: fix aliasing in english (nice to have) don't touch other languages
going forward, use validated.csv instead of validated.tsv for transcript .labs - these might filter out quotation marks/commas already for all languages