NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Need some info on spanish asr ctc confirmer large model.

Open evilc3 opened this issue 3 years ago • 4 comments

Can I get more data on the dataset cleaning process?

I am a non spanish speaker 😅.

Like voxpoplui datasets says 120hrs after cleaning. I downloaded the dataset it has 150 hrs. Need to know which 30 hrs to clean please.

also in model description it says trained the confirmer large ctc on 7000 hrs of pretrained model. But it points to model trained on 16k hrs.

Thanks for the help

evilc3 avatar Aug 08 '22 12:08 evilc3

@erastorgueva-nv could you respond with some info here

titu1994 avatar Aug 08 '22 20:08 titu1994

Hello, to answer your questions:

Data preparation:

In general:

  • we removed punctuation (replaced with space ” “) as our ASR model cannot predict punctuation.
  • Dropped words with too-low and too-high a character rate (which implies the ground truth label is incorrect).
  • Removed utterances characters which are not in the Spanish alphabet (ie not in ” abcdefghijklmnopqrstuvwxyzáéíñóúü“).
  • Dropped utterances with too high a WER according to a previously-trained NeMo Spanish model (because a high WER implies the ground truth label is incorrect).

For Voxpopuli there were some extra steps, eg.:

  • drop utterances if according to an existing ASR model (stt_es_citrinet_512) there is an insertion or deletion of more than 10 characters. This is to remove utterances which have extra text spoken that is not in the ground truth (or the inverse: are missing spoken text which is included in the ground truth).

Pretrained model:

The Spanish Conformer CTC model was fine-tuned from the v1.6.0 English Conformer CTC model, which was trained on approximately 7,000 hours of English data. There is now a newer (v1.10.0) English Conformer CTC model which was trained on around 24,500 hours of English data. You will see references to the different English models at the bottom of this page (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_large) and also here (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_large/version).

erastorgueva-nv avatar Aug 10 '22 20:08 erastorgueva-nv

Hi thanks for the reply.

Will follow the steps , super thanks for the Voxpopuli tip

How did u deal with long audios. ?

I hd tried using ctc_segmentation with the latest ctc_confirmer model my goal was to create 20 sec chunks out of large audios. But this didnt help.

Did u make use of the ctc_segmentation_tool in nemo ?

So for cleaning

Should I just transcribe using the best model available. And then just remove files > 100 wer ?

Also any reason for removing punctuation?

evilc3 avatar Aug 12 '22 16:08 evilc3

point 2 : Dropped words with too-low and too-high a character rate (which implies the ground truth label is incorrect).

why do I drop samples with too-low sampling character rate? how is the the ground truth label incorrect. in this case? @erastorgueva-nv

evilc3 avatar Aug 16 '22 11:08 evilc3

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 06 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 13 '22 02:10 github-actions[bot]