snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

WIP: huggingface tokenizer and Neural LM training pipeline.

Open glynpu opened this issue 4 years ago • 11 comments

Fixes #132 2021-04-23 use AM model trained with full librispeech data

rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore (Piotr's am with full librispeech) * * * * 4.71 9.66
4-gram LM n-best rescore(Piotr's am with full librispeech) * 100 * * 4.38 9.18
4-gram LM lattice rescore * * * * 4.18 8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5 9 100 45.02 115.24 3.61 8.29

2021-04-21 max_norm=5 is better than max_norm=0.25. The training is ongoing. ~16 layers trained with Noam optimizer got a better wer than previous 8-layer transformers.~ ~But with this reference, max_norm=0.25 in clip_grad_norm_ seems TOO SMALL, which may explains epoch-19 only obtain a little gains comparing to epoch-3.~ Now max_norm=5 is used refering to espent transformer lm , and results coming soon.

rescore LM epoch num_paths token ppl word ppl test-clean test-other
baseline no rescore (from fangjun) * * * * 6.80 18.03
4-gram LM (from fangjun) * 100 * * 6.28 16.94
transformer LM layers: 8 (model_size: 42M) 10 100 55.04 148.07 5.66 16.09
30 100 53.16 141.77 5.60 16.09
transformer LM layers: 16 (model_size: 72M) 2 100 51.86 139.35 5.51 16.00
3 100 51.20 135.37 5.47 15.90 
19 100 48.58 126.71 5.37 15.77
transformer LM layers: 16 (model_size: 72M) max_norm=5 1 100 46.94 121.41 5.39 15.73
4 100 45.88 118 5.27 15.73

--------- previous comments------ This commit is mainly about hugginface tokenizer and a draft transformer/RNN based LM training pipeline.

They are implemented mainly by referencing the follwing tutorials: tokenizer and neural LM which is also referenced by Espnet

Current (tokenizer + transformer LM) experiment shows that the PPL can decrease from around 1000 to aroud 110 with 10 epochs, as shown by the following screenshots.

255c9a5d80f4b3a38e86186936fcacd d0e25a3ebc12a5e3a19c9d54e4fecfa

TODOs: ~1. Extend this training pipeline with advanced utils, such as multi-thread prefetching Dataloader with proper collate_fn and tensorboard summary writer.~ ~2. Evaluation/test parts~ ~3. Do experiments with full Librispeech data. Currently only 50MB training text is used out of around 4GB.~ 4. A proper way to integrate NNLM into previous asr decode pipeline, i.e. the aim of the issue #132 5. Try other network structures.

glynpu avatar Mar 25 '21 11:03 glynpu

These perplexities, are they per word or per token?

danpovey avatar Mar 28 '21 15:03 danpovey

These perplexities, are they per word or per token?

per token.

glynpu avatar Mar 28 '21 15:03 glynpu

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

csukuangfj avatar Mar 28 '21 15:03 csukuangfj

It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.

On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang @.***> wrote:

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu https://github.com/glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#issuecomment-808915648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ .

danpovey avatar Mar 28 '21 15:03 danpovey

In our original paper we mention perplexities of 150 and 170.

On Sun, Mar 28, 2021 at 11:56 PM Daniel Povey @.***> wrote:

It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.

On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang @.***> wrote:

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu https://github.com/glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#issuecomment-808915648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ .

danpovey avatar Mar 28 '21 15:03 danpovey

As shown by RNN-LM experiment in kaldi with librispeech data,

# rnnlm/train_rnnlm.sh: train/dev perplexity was 109.2 / 110.7.

I am studying its configuration and hope to get a comparable ppl with the same data this week.

glynpu avatar Mar 28 '21 16:03 glynpu

Yes, probably that modulo method from Kaldi is fine. shuf is not always installed.

On Mon, Mar 29, 2021 at 9:47 AM LIyong.Guo @.***> wrote:

@.**** commented on this pull request.

In egs/librispeech/asr/nnlm/run.sh https://github.com/k2-fsa/snowfall/pull/139#discussion_r602971474:

  • --test-file=$full_text \
  • --tokenizer-path=$tokenizer +fi

+if [ $stage -eq 4 ]; then

  • echo "split all data into train/valid/test"
  • full_tokens=${full_text}.tokens
  • valid_test_fraction=10 # currently 5 percent for valid and 5 percent for test
  • valid_test_tokens=$lm_train/valid_test.tokens
  • train_tokens=$lm_train/train.tokens
  • num_utts_total=$(wc -l <$full_tokens )
  • num_valid_test=$(($num_utts_total/${valid_test_fraction}))
  • set +x
  • shuf -n $num_valid_test $full_tokens > $valid_test_tokens

Reproducible is important. Maybe the data seperation method of kaldi RNNLM https://github.com/kaldi-asr/kaldi/blob/pybind11/egs/librispeech/s5/local/rnnlm/tuning/run_tdnn_lstm_1a.sh#L75 can be used in following experiments. gunzip -c $text | cut -d ' ' -f2- | awk -v text_dir=$text_dir '{if(NR%2000 == 0) { print >text_dir"/dev.txt"; } else {print;}}'

$text_dir/librispeech.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#discussion_r602971474, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO64IOJT2WBXGTJLV53TF7L3TANCNFSM4ZZGJHRQ .

danpovey avatar Mar 29 '21 02:03 danpovey

Good work! I will try to read and understand what you are doing.

On Tue, Mar 30, 2021 at 1:45 PM LIyong.Guo @.***> wrote:

@.**** commented on this pull request.

In egs/librispeech/asr/nnlm/main.py https://github.com/k2-fsa/snowfall/pull/139#discussion_r603797150:

+###############################################################################

+# Load data

+###############################################################################

+corpus = data.Corpus(args.data)

+# Starting from sequential data, batchify arranges the dataset into columns.

+# For instance, with the alphabet as the sequence and batch size 4, we'd get

+# ┌ a g m s ┐

+# │ b h n t │

+# │ c i o u │

+# │ d j p v │

+# │ e k q w │

+# └ f l r x ┘.

+# These columns are treated as independent by the model, which means that the

+# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient

No, training text is not treated as a long sequence. I have modified the data preparation method so that each piece of text is treated independently. Sorry to forget to delete these unrelated original comments. By the way, I am refactoring the training pipeline according these reviews. Temporarily, a new dataset class is located here https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L51, which handle training text one by one and then batchfy them independtly in CollateFunc https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L15 .

    with open(text_file, 'r') as f:

        # a line represent a piece of text, e.g.

        # DELAWARE IS NOT AFRAID OF DOGS

        for line in f:

            text = line.strip().split()

            assert len(text) > 0

            text_id = self.text2id(text)

            # token_id format:

            # <bos_id> token_id token_id token_id *** <eos_id>

            token_id = self.text_id2token_id(text_id)

            self.data.append(token_id)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#discussion_r603797150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7VKW4C76RZ4LSJVR3TGFQOTANCNFSM4ZZGJHRQ .

danpovey avatar Mar 30 '21 05:03 danpovey

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

danpovey avatar Apr 01 '21 04:04 danpovey

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

A commit to handle this together with other known bugs will be submitted this afternoon.

glynpu avatar Apr 01 '21 04:04 glynpu

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

@danpovey add statement to automatically install dependencies in run.sh

if [ $stage -eq -1 ]; then
  # env for experiment ../simple_v1 is expected to have been built.
  echo "Install extra dependencies"
  pip install -r requirements.txt
fi

Now I am still facing some converging issues. With several epochs, the ppl stuck around 1000. I am not sure where there are some critical unkown bugs or just because of unapproriate hype-parameters configuration.

glynpu avatar Apr 01 '21 12:04 glynpu