multilingual_kws icon indicating copy to clipboard operation
multilingual_kws copied to clipboard

expand and validate text normalization/cleaning filters

Open mmaz opened this issue 2 years ago • 1 comments

given two transcripts 1. [hello is a common greeting] and 2. [she said, “hello”], without punctuation filtering we would otherwise treat [hello] and [“hello”] as separate words

mmaz avatar Sep 28 '21 00:09 mmaz

see which languages this impacts - Sharad says maybe 1K words total

v1: fix aliasing in english (nice to have) don't touch other languages

going forward, use validated.csv instead of validated.tsv for transcript .labs - these might filter out quotation marks/commas already for all languages

mmaz avatar Oct 06 '21 15:10 mmaz