NeMo
NeMo copied to clipboard
Need some info on spanish asr ctc confirmer large model.
Can I get more data on the dataset cleaning process?
I am a non spanish speaker 😅.
Like voxpoplui datasets says 120hrs after cleaning. I downloaded the dataset it has 150 hrs. Need to know which 30 hrs to clean please.
also in model description it says trained the confirmer large ctc on 7000 hrs of pretrained model. But it points to model trained on 16k hrs.
Thanks for the help
@erastorgueva-nv could you respond with some info here
Hello, to answer your questions:
Data preparation:
In general:
- we removed punctuation (replaced with space ” “) as our ASR model cannot predict punctuation.
- Dropped words with too-low and too-high a character rate (which implies the ground truth label is incorrect).
- Removed utterances characters which are not in the Spanish alphabet (ie not in ” abcdefghijklmnopqrstuvwxyzáéíñóúü“).
- Dropped utterances with too high a WER according to a previously-trained NeMo Spanish model (because a high WER implies the ground truth label is incorrect).
For Voxpopuli there were some extra steps, eg.:
- drop utterances if according to an existing ASR model (stt_es_citrinet_512) there is an insertion or deletion of more than 10 characters. This is to remove utterances which have extra text spoken that is not in the ground truth (or the inverse: are missing spoken text which is included in the ground truth).
Pretrained model:
The Spanish Conformer CTC model was fine-tuned from the v1.6.0 English Conformer CTC model, which was trained on approximately 7,000 hours of English data. There is now a newer (v1.10.0) English Conformer CTC model which was trained on around 24,500 hours of English data. You will see references to the different English models at the bottom of this page (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_large) and also here (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_large/version).
Hi thanks for the reply.
Will follow the steps , super thanks for the Voxpopuli tip
How did u deal with long audios. ?
I hd tried using ctc_segmentation with the latest ctc_confirmer model my goal was to create 20 sec chunks out of large audios. But this didnt help.
Did u make use of the ctc_segmentation_tool in nemo ?
So for cleaning
Should I just transcribe using the best model available. And then just remove files > 100 wer ?
Also any reason for removing punctuation?
point 2 : Dropped words with too-low and too-high a character rate (which implies the ground truth label is incorrect).
why do I drop samples with too-low sampling character rate? how is the the ground truth label incorrect. in this case? @erastorgueva-nv
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.