snowfall WIP: huggingface tokenizer and Neural LM training pipeline.

Fixes #132 2021-04-23 use AM model trained with full librispeech data

rescore LM	epoch	num_paths	token ppl	word ppl	test-clean	test-other
baseline no rescore (Piotr's am with full librispeech)	*	*	*	*	4.71	9.66
4-gram LM n-best rescore(Piotr's am with full librispeech)	*	100	*	*	4.38	9.18
4-gram LM lattice rescore	*	*	*	*	4.18	8.54
transformer LM layers: 16 (model_size: 72M) max_norm=5	9	100	45.02	115.24	3.61	8.29

2021-04-21 max_norm=5 is better than max_norm=0.25. The training is ongoing. ~16 layers trained with Noam optimizer got a better wer than previous 8-layer transformers.~ ~But with this reference, max_norm=0.25 in clip_grad_norm_ seems TOO SMALL, which may explains epoch-19 only obtain a little gains comparing to epoch-3.~ ~~Now max_norm=5 is used refering to espent transformer lm , and results coming soon.~~

rescore LM	epoch	num_paths	token ppl	word ppl	test-clean	test-other
baseline no rescore (from fangjun)	*	*	*	*	6.80	18.03
4-gram LM (from fangjun)	*	100	*	*	6.28	16.94
transformer LM layers: 8 (model_size: 42M)	10	100	55.04	148.07	5.66	16.09
	30	100	53.16	141.77	5.60	16.09
transformer LM layers: 16 (model_size: 72M)	2	100	51.86	139.35	5.51	16.00
	3	100	51.20	135.37	5.47	15.90
	19	100	48.58	126.71	5.37	15.77
transformer LM layers: 16 (model_size: 72M) max_norm=5	1	100	46.94	121.41	5.39	15.73
	4	100	45.88	118	5.27	15.73

--------- previous comments------ This commit is mainly about hugginface tokenizer and a draft transformer/RNN based LM training pipeline.

They are implemented mainly by referencing the follwing tutorials: tokenizer and neural LM which is also referenced by Espnet

Current (tokenizer + transformer LM) experiment shows that the PPL can decrease from around 1000 to aroud 110 with 10 epochs, as shown by the following screenshots.

255c9a5d80f4b3a38e86186936fcacd d0e25a3ebc12a5e3a19c9d54e4fecfa

TODOs: ~1. Extend this training pipeline with advanced utils, such as multi-thread prefetching Dataloader with proper collate_fn and tensorboard summary writer.~ ~2. Evaluation/test parts~ ~3. Do experiments with full Librispeech data. Currently only 50MB training text is used out of around 4GB.~ 4. A proper way to integrate NNLM into previous asr decode pipeline, i.e. the aim of the issue #132 5. Try other network structures.

Mar 25 '21 11:03 glynpu

These perplexities, are they per word or per token?

Mar 28 '21 15:03 danpovey

These perplexities, are they per word or per token?

per token.

Mar 28 '21 15:03 glynpu

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

Mar 28 '21 15:03 csukuangfj

It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.

On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang @.***> wrote:

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu https://github.com/glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#issuecomment-808915648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ .

Mar 28 '21 15:03 danpovey

In our original paper we mention perplexities of 150 and 170.

On Sun, Mar 28, 2021 at 11:56 PM Daniel Povey @.***> wrote:

It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.

On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang @.***> wrote:

the PPL can decrease from around 1000 to aroud 110 with 10 epochs,

@glynpu https://github.com/glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#issuecomment-808915648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ .

Mar 28 '21 15:03 danpovey

As shown by RNN-LM experiment in kaldi with librispeech data,

# rnnlm/train_rnnlm.sh: train/dev perplexity was 109.2 / 110.7.

I am studying its configuration and hope to get a comparable ppl with the same data this week.

Mar 28 '21 16:03 glynpu

Yes, probably that modulo method from Kaldi is fine. shuf is not always installed.

On Mon, Mar 29, 2021 at 9:47 AM LIyong.Guo @.***> wrote:

@.**** commented on this pull request.

In egs/librispeech/asr/nnlm/run.sh https://github.com/k2-fsa/snowfall/pull/139#discussion_r602971474:

--test-file=$full_text \

--tokenizer-path=$tokenizer +fi

+if [ $stage -eq 4 ]; then

echo "split all data into train/valid/test"

full_tokens=${full_text}.tokens

valid_test_fraction=10 # currently 5 percent for valid and 5 percent for test

valid_test_tokens=$lm_train/valid_test.tokens

train_tokens=$lm_train/train.tokens

num_utts_total=$(wc -l <$full_tokens )

num_valid_test=$(($num_utts_total/${valid_test_fraction}))

set +x

shuf -n $num_valid_test $full_tokens > $valid_test_tokens

Reproducible is important. Maybe the data seperation method of kaldi RNNLM https://github.com/kaldi-asr/kaldi/blob/pybind11/egs/librispeech/s5/local/rnnlm/tuning/run_tdnn_lstm_1a.sh#L75 can be used in following experiments. gunzip -c $text | cut -d ' ' -f2- | awk -v text_dir=$text_dir '{if(NR%2000 == 0) { print >text_dir"/dev.txt"; } else {print;}}'

$text_dir/librispeech.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#discussion_r602971474, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO64IOJT2WBXGTJLV53TF7L3TANCNFSM4ZZGJHRQ .

Mar 29 '21 02:03 danpovey

Good work! I will try to read and understand what you are doing.

On Tue, Mar 30, 2021 at 1:45 PM LIyong.Guo @.***> wrote:

@.**** commented on this pull request.

In egs/librispeech/asr/nnlm/main.py https://github.com/k2-fsa/snowfall/pull/139#discussion_r603797150:

+###############################################################################

+# Load data

+###############################################################################

+corpus = data.Corpus(args.data)

+# Starting from sequential data, batchify arranges the dataset into columns.

+# For instance, with the alphabet as the sequence and batch size 4, we'd get

+# ┌ a g m s ┐

+# │ b h n t │

+# │ c i o u │

+# │ d j p v │

+# │ e k q w │

+# └ f l r x ┘.

+# These columns are treated as independent by the model, which means that the

+# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient

No, training text is not treated as a long sequence. I have modified the data preparation method so that each piece of text is treated independently. Sorry to forget to delete these unrelated original comments. By the way, I am refactoring the training pipeline according these reviews. Temporarily, a new dataset class is located here https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L51, which handle training text one by one and then batchfy them independtly in CollateFunc https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L15 .
    with open(text_file, 'r') as f:

        # a line represent a piece of text, e.g.

        # DELAWARE IS NOT AFRAID OF DOGS

        for line in f:

            text = line.strip().split()

            assert len(text) > 0

            text_id = self.text2id(text)

            # token_id format:

            # <bos_id> token_id token_id token_id *** <eos_id>

            token_id = self.text_id2token_id(text_id)

            self.data.append(token_id)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#discussion_r603797150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7VKW4C76RZ4LSJVR3TGFQOTANCNFSM4ZZGJHRQ .

Mar 30 '21 05:03 danpovey

Something is not installed...

 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'

I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

Apr 01 '21 04:04 danpovey

Something is not installed...
 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'
I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

A commit to handle this together with other known bugs will be submitted this afternoon.

Apr 01 '21 04:04 glynpu

Something is not installed...
 ./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
  File "local/huggingface_tokenizer.py", line 12, in <module>
    from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'
I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?

@danpovey add statement to automatically install dependencies in run.sh

if [ $stage -eq -1 ]; then
  # env for experiment ../simple_v1 is expected to have been built.
  echo "Install extra dependencies"
  pip install -r requirements.txt
fi

Now I am still facing some converging issues. With several epochs, the ppl stuck around 1000. I am not sure where there are some critical unkown bugs or just because of unapproriate hype-parameters configuration.

Apr 01 '21 12:04 glynpu

snowfall snowfall copied to clipboard

WIP: huggingface tokenizer and Neural LM training pipeline.

@.**** commented on this pull request.

@.**** commented on this pull request.

snowfall
snowfall copied to clipboard