multilingual_kws icon indicating copy to clipboard operation
multilingual_kws copied to clipboard

Filter out NaNs from Common Voice tsvs, distinguish between intentional "nan" in language vocabulary

Open mmaz opened this issue 2 years ago • 0 comments

in German, 'null' (zero) is being converted to NaN by pandas when it is the only word present in the transcript (due to single-word-target-segments data)

One option is to use filter_na=False when reading Common Voice TSVs https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

however, we should also first check for truly missing values in the sentence transcription column

mmaz avatar Aug 24 '21 12:08 mmaz