TensorFlowASR WER for conformer update

Hi, I've just ended a training of a conformer using the sentencepiece featurizer on LibriSpeech over 50 epochs. Here are the results if you want to update your readme:

dataset_config:
    train_paths:
      - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
      - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
      - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
    eval_paths:
      - /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/dev-other/transcripts.tsv
    test_paths:
      - /data/datasets/LibriSpeech/test-clean/transcripts.tsv

Test results: G_WER = 5.22291565 G_CER = 1.9693377 B_WER = 5.19438553 B_CER = 1.95449066 BLM_WER = 100 BLM_CER = 100

The strange part is that I dot the same metrics on test-other dataset hmmm...

Jan 22 '21 13:01 gandroz

@gandroz Wow cool, if you got the same result for test-other then you should check the transcript file to see if it points to test-other files. And you should check the test-clean transcripts file too. Anyway, I'm thinking that maybe the authors have some tricks that reduce the result to 2.7% that we didn't see.

Jan 23 '21 15:01 nglehuy

And one more thing is that there's a very small difference between greedy and beam search at this kind of WER percent, so we can ignore the difference and test only on greedy to see if it reduces to near 2.7-3%, for getting faster results

Jan 23 '21 16:01 nglehuy

I'll try to continue training for several epochs, training seems not to have ended. I'll read the paper again to look for any clue on how to reduce WER even more. But I dont have anything special in my transcripts, both test-clean and test-other are well segregated.

Jan 23 '21 17:01 gandroz

@gandroz You should check or generate the transcript file again, may be when creating test-other transcript file, you point to the test-clean directory. If everything is right, then it's so weird haha :laughing:

Jan 23 '21 18:01 nglehuy

I checked both files, my config file too and got the same results. So weird. I'll try to debug to find any mistake

Le sam. 23 janv. 2021 13:03, Nguyễn Lê Huy [email protected] a écrit :

@gandroz https://github.com/gandroz You should check or generate the transcript file again, may be when creating test-other transcript file, you point to the test-clean directory. If everything is right, then it's so weird haha 😆

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TensorSpeech/TensorFlowASR/issues/124#issuecomment-766152916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCXOANR2CFFQ2EDBTOUSDDS3MFP3ANCNFSM4WOP6C2A .

Jan 23 '21 19:01 gandroz

I found why I always got the same test metrics.... I tested on the test-clean dataset and it saved a test.tsv file, but each time I performed another test, as there was already an existing file, only the metrics were computed and no inference was done. I've cleaned this file and have launched another test with the test-other dataset to continue the update.

Jan 26 '21 13:01 gandroz

@gandroz Can you post your full config file you are using to generate the ~5% WER results?

Thanks!!!

Jan 29 '21 17:01 ncilfone

@ncilfone sure !

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  output_path_prefix: /data/models/asr/conformer_sentencepiece_subword
  model_type: unigram
  target_vocab_size: 1024
  blank_at_zero: True
  beam_width: 5
  norm_score: True
  corpus_files:
    - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
    - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
    - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0.1
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 1
  prediction_layer_norm: True
  prediction_projection_units: 0
  joint_dim: 320
  joint_activation: tanh

learning_config:
  augmentations:
    after:
      time_masking:
        num_masks: 10
        mask_factor: 100
        p_upperbound: 0.05
      freq_masking:
        num_masks: 1
        mask_factor: 27

  dataset_config:
    train_paths:
      - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
      - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
      - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
    eval_paths:
      - /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/dev-other/transcripts.tsv
    test_paths:
      - /data/datasets/LibriSpeech/test-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/test-other/transcripts.tsv
    tfrecords_dir: null

  optimizer_config:
    warmup_steps: 10000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    accumulation_steps: 4
    num_epochs: 50
    outdir: /data/models/asr/conformer_sentencepiece_subword
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    checkpoint:
      filepath: /data/models/asr/conformer_sentencepiece_subword/checkpoints/{epoch:02d}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
    states_dir: /data/models/asr/conformer_sentencepiece_subword/states
    tensorboard:
      log_dir: /data/models/asr/conformer_sentencepiece_subword/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2

I used a sentencepiece (unigram) model as vocab, currently trying with the BPE version

Jan 29 '21 18:01 gandroz

Thanks @gandroz!

Is that the vocab here: vocabularies/librispeech_train_4_1030.subwords

Edit: Based on the config it seems like you might generate one before training?

Also is this just single GPU training?

Jan 29 '21 18:01 ncilfone

no it's not that vocab. However, you can train yours with script\generate_vocab_sentencepiece.py giving your config file. And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.

Jan 29 '21 18:01 gandroz

Yeah just realized that you generate it based on the config options. Thanks for letting me know!

I'm assuming you are doing the featurization of the WAV files in TF as the stft etc. should be a bit faster on the GPU. DALI might be another place to look too although I've never used it...

Jan 29 '21 18:01 ncilfone

Final question I promise... It looks like you are using and tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

Jan 29 '21 18:01 ncilfone

I think the best way to accelerate processing is to pre-process fbank just as it done on fairseq. For your information, featurization is done by the class tensorflow_asr\featurizers\speech_featurizers.py::TFSpeechFeaturizer.

I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

I'm not sure to understand well your question. Sentencepiece is an unsupervised text tokenizer and detokenizer so you have to train a model on the transcripts from LibriSpeech. Tokenized transcripts are padded to the biggest sentence during training for each batch.

Jan 29 '21 18:01 gandroz

Ugh forgot that markdown will remove the notation I used... This is what I meant...

It looks like you are using <sos> and <eos> tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

Jan 29 '21 18:01 ncilfone

Oh I see. You are right, transcripts does not have those tokens and they are useless as far as I understand it. However, you can add them when encoding some text. You could find more details on the repo, and I've just realized that there is a tensorflow binding.... I think I'll try it instead of the python implementation I used.

Jan 29 '21 18:01 gandroz

Hi @gandroz , Have you tested on test-other set, and what is the result? Thanks!

Jan 30 '21 15:01 tund

@tund not yet, it took me a week to test on test-clean and I did not have time yet

Jan 30 '21 16:01 gandroz

Thanks for your reply @gandroz . Since the performance using beam-search is quite close to the greedy-search, I think only running greedy-search will be much faster. Another question: do you use Gradient Accumulation for trainng? I saw: "accumulation_steps: 4" in the config file, but not sure what your training command exactly is.

Jan 30 '21 23:01 tund

Indeed, I could just perform greedy search for this test. In a near future perhaps... And yes, I used gradient accumulation.

Jan 31 '21 00:01 gandroz

@gandroz any chance you can post your loss curves?

Feb 01 '21 22:02 ncilfone

sure

The glitches at the end are due to infinite loop bug corrected afterwards (evaluation occured endlessly after training ended). I trained the model for 40 epochs first and continued for 10 more epochs.

Feb 02 '21 00:02 gandroz

How you are able to achieve such good results with your models? I've trained conformed subword model, but it stops improving after ~20 epochs.

I've updated Keras trainer to use EarlyStopping and stops the training process after 5 epochs without improvement to validation loss.

What am I missing?

Train data: 50hrs Eval data: 7hrs Using TF RNN Loss

Audio lengths. Not sure :

mean       2.646981
std        2.420535
min        0.100000
25%        0.900000
50%        1.570000
75%        4.030000
max       20.000000

The test results are complete rubbish:

G_WER = 114.837982
G_CER = 88.0064
B_WER = 100
B_CER = 100
BLM_WER = 100
BLM_CER = 100

config

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  vocabulary: vocabularies/lithuanian.subwords
  target_vocab_size: 4096
  max_subword_length: 4
  blank_at_zero: True
  beam_width: 0
  norm_score: True
  corpus_files:
    - /tf_asr/manifests/liepa.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 2
  prediction_layer_norm: False
  prediction_projection_units: 0
  joint_dim: 320
  joint_activation: tanh

learning_config:
  train_dataset_config:
    use_tf: True
    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27
    data_paths:
      - /tf_asr/manifests/liepa_train.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-train
    shuffle: True
    cache: False
    buffer_size: 100
    drop_remainder: True

  eval_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/liepa_eval.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-eval
    shuffle: False
    cache: False
    buffer_size: 100
    drop_remainder: True

  test_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/liepa_test.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-test
    shuffle: False
    cache: False
    buffer_size: 100
    drop_remainder: True

  optimizer_config:
    warmup_steps: 40000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    accumulation_steps: 4
    num_epochs: 20
    outdir: /tf_asr/models
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    early_stopping:
      monitor: "val_val_rnnt_loss"
      mode: "min"
      patience: 5
      verbose: 1
    checkpoint:
      filepath: /tf_asr/models/checkpoints/epoch-{epoch:02d}-{val_val_rnnt_loss:.4f}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
      verbose: 1
      monitor: "val_val_rnnt_loss"
      mode: "min"
    states_dir: /tf_asr/models/states
    tensorboard:
      log_dir: /tf_asr/models/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2

Feb 13 '21 10:02 mjurkus

@mjurkus Could you show the loss curves?

Feb 13 '21 16:02 nglehuy

@mjurkus my training was performed over the LibriSpeech data, 960h of data for training. ASR needs lots of data to converge, so maybe you need more. Furthermore, maybe LibriSpeech data is cleaner than yours ? I also have some proprietary data but they are way worse than LibriSpeech (not even the same sampling rate). But perhaps you could share the training curves ?

Feb 13 '21 16:02 gandroz

Yeah, the amount of data is the answer... That's what I thought.

Here's couple: Very clean, 16k data, 50hrs: train_rnnt_loss,val_val_rnnt_loss

Mixed data: clean and noisy, 16k, 100hrs: train_rnnt_loss,val_val_rnnt_loss (1)

It's hard to get good labeled data for my language.

Feb 13 '21 19:02 mjurkus

Your model does not seem to learn anything.... Try to reduce your LR, explore some data augmentation as it could help.

Feb 13 '21 20:02 gandroz

Using conformer with characters worked way better, than using subwords. Managed to get decent results (WER ~15%) do not have the graphs for those, though.

Regarding augmentation - I figured, that this config enables augmentation.

    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27

Feb 14 '21 08:02 mjurkus

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

Feb 18 '21 09:02 jinggaizi

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

@jinggaizi What vocabulary size did you use, 1k or 4k or english characters (around 28)?

Feb 18 '21 10:02 nglehuy

1k

Feb 18 '21 11:02 jinggaizi

TensorFlowASR TensorFlowASR copied to clipboard

WER for conformer update

TensorFlowASR
TensorFlowASR copied to clipboard