DeepSpeech
DeepSpeech copied to clipboard
(Very) obscure words need culling
I ran DeepSpeech on a very ordinary audio file (only 2 minutes) with a USA male English speaker.
It came up with some (very) obscure words. Here they are, with actual text following in brackets:
philippowanians (Philip, you know) cortwrite (sort of written) mariannakookaland (week, marking a week) lilybean (likely to)
Take "mariannakookaland". It appears in Punch, January 17, 1891 and was never heard of again. Google returns a trifling 6 results, and only one is relevant - the same Punch article on Project Gutenberg!
Apart from any Philippowanians from Mariannakookaland, this is definitely a case of more data not necessarily being an improvement.
From my untutored perspective it seems simple enough to filter the dictionaries against a word frequency list and discard the (very) obscure and bizarre.
More understandably, DeepSpeech also tripped up on some currently topical words:
facino, fouchet (Fauci) bidden (Biden) supersede (super-spreader)
Thank you for DeepSpeech,
Mike
Yes, I too had some weird words. Doesn't Deepspeech contain a language model when predicting the words?. if so why do you think such words have been recognized. As in a generic English LM, these words must have had a very tiny probability right.
Instructions on how our language model is built are available here: https://deepspeech.readthedocs.io/en/latest/Scorer.html#reproducing-our-external-scorer
You should be able to build modified versions of the scorer with your own dataset or pruning parameters following those.