vosk-server icon indicating copy to clipboard operation
vosk-server copied to clipboard

There is some mistakes in align_lexicon.txt for french model 0.22

Open warichet opened this issue 2 years ago • 3 comments

Hello, we make some test of vosk asr platform in french. The chosen model is vosk-model-fr-0.22, due to the license Apache2. The g2p used to generate align_lexicon.txt have somme bug. One of them is related to acronym treatment, i explain. In french the normal way to prononce c.... is "K" except for acronyme where it is "S E". For the acronym, we spell the letter.

To continue our tests we need to fine tune the LM wit the correct lexicon. Is it possible to have the arpa file to test our corrections ?

In return, of course, we will provide you the new lexicon.

Best regards Seb

warichet avatar Sep 15 '22 12:09 warichet

Is it possible to have the arpa file to test our corrections ?

Sure, you have to write email to [email protected] and describe your project to get link to French model update package

nshmyrev avatar Sep 15 '22 13:09 nshmyrev

Hello,

We will start the check/ correction of model. To do it easier the chosen is way is to use the french dataset of lingua-libre. lingua libre is a very complet annotated dataset. We will generate a 1 gram graph (aka model without graph) to check only Acoustic Model and Lexicon. Pass the audio files to Vosk and check words where transcription is different from the annotation. With this result we can check words with bad phonemes, and possibly check phonemes where AM have difficulties. We will do it into notebook. If you are interested by it, we can share the notebook.

Best regard Sébastien

warichet avatar Nov 09 '22 17:11 warichet

Sure, that would be nice.

nshmyrev avatar Nov 09 '22 22:11 nshmyrev