DeepSpeech (Very) obscure words need culling

(Very) obscure words need culling

Open theoldbloke opened this issue 4 years ago • 2 comments

I ran DeepSpeech on a very ordinary audio file (only 2 minutes) with a USA male English speaker.

It came up with some (very) obscure words. Here they are, with actual text following in brackets:

philippowanians (Philip, you know) cortwrite (sort of written) mariannakookaland (week, marking a week) lilybean (likely to)

Take "mariannakookaland". It appears in Punch, January 17, 1891 and was never heard of again. Google returns a trifling 6 results, and only one is relevant - the same Punch article on Project Gutenberg!

Apart from any Philippowanians from Mariannakookaland, this is definitely a case of more data not necessarily being an improvement.

From my untutored perspective it seems simple enough to filter the dictionaries against a word frequency list and discard the (very) obscure and bizarre.

More understandably, DeepSpeech also tripped up on some currently topical words:

facino, fouchet (Fauci) bidden (Biden) supersede (super-spreader)

Thank you for DeepSpeech,

Mike

Dec 29 '20 13:12 theoldbloke

Yes, I too had some weird words. Doesn't Deepspeech contain a language model when predicting the words?. if so why do you think such words have been recognized. As in a generic English LM, these words must have had a very tiny probability right.

Jan 04 '21 14:01 chmodsss

Instructions on how our language model is built are available here: https://deepspeech.readthedocs.io/en/latest/Scorer.html#reproducing-our-external-scorer

You should be able to build modified versions of the scorer with your own dataset or pruning parameters following those.

Jan 04 '21 14:01 reuben

DeepSpeech DeepSpeech copied to clipboard

(Very) obscure words need culling

DeepSpeech
DeepSpeech copied to clipboard