snowfall
snowfall copied to clipboard
WIP: huggingface tokenizer and Neural LM training pipeline.
Fixes #132 2021-04-23 use AM model trained with full librispeech data
| rescore LM | epoch | num_paths | token ppl | word ppl | test-clean | test-other |
|---|---|---|---|---|---|---|
| baseline no rescore (Piotr's am with full librispeech) | * | * | * | * | 4.71 | 9.66 |
| 4-gram LM n-best rescore(Piotr's am with full librispeech) | * | 100 | * | * | 4.38 | 9.18 |
| 4-gram LM lattice rescore | * | * | * | * | 4.18 | 8.54 |
| transformer LM layers: 16 (model_size: 72M) max_norm=5 | 9 | 100 | 45.02 | 115.24 | 3.61 | 8.29 |
2021-04-21
max_norm=5 is better than max_norm=0.25. The training is ongoing.
~16 layers trained with Noam optimizer got a better wer than previous 8-layer transformers.~
~But with this reference, max_norm=0.25 in clip_grad_norm_ seems TOO SMALL, which may explains epoch-19 only obtain a little gains comparing to epoch-3.~
Now max_norm=5 is used refering to espent transformer lm , and results coming soon.
| rescore LM | epoch | num_paths | token ppl | word ppl | test-clean | test-other |
|---|---|---|---|---|---|---|
| baseline no rescore (from fangjun) | * | * | * | * | 6.80 | 18.03 |
| 4-gram LM (from fangjun) | * | 100 | * | * | 6.28 | 16.94 |
| transformer LM layers: 8 (model_size: 42M) | 10 | 100 | 55.04 | 148.07 | 5.66 | 16.09 |
| 30 | 100 | 53.16 | 141.77 | 5.60 | 16.09 | |
| transformer LM layers: 16 (model_size: 72M) | 2 | 100 | 51.86 | 139.35 | 5.51 | 16.00 |
| 3 | 100 | 51.20 | 135.37 | 5.47 | 15.90 | |
| 19 | 100 | 48.58 | 126.71 | 5.37 | 15.77 | |
| transformer LM layers: 16 (model_size: 72M) max_norm=5 | 1 | 100 | 46.94 | 121.41 | 5.39 | 15.73 |
| 4 | 100 | 45.88 | 118 | 5.27 | 15.73 |
--------- previous comments------ This commit is mainly about hugginface tokenizer and a draft transformer/RNN based LM training pipeline.
They are implemented mainly by referencing the follwing tutorials: tokenizer and neural LM which is also referenced by Espnet
Current (tokenizer + transformer LM) experiment shows that the PPL can decrease from around 1000 to aroud 110 with 10 epochs, as shown by the following screenshots.

TODOs: ~1. Extend this training pipeline with advanced utils, such as multi-thread prefetching Dataloader with proper collate_fn and tensorboard summary writer.~ ~2. Evaluation/test parts~ ~3. Do experiments with full Librispeech data. Currently only 50MB training text is used out of around 4GB.~ 4. A proper way to integrate NNLM into previous asr decode pipeline, i.e. the aim of the issue #132 5. Try other network structures.
These perplexities, are they per word or per token?
These perplexities, are they per word or per token?
per token.
the PPL can decrease from around 1000 to aroud 110 with 10 epochs,
@glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?
It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.
On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang @.***> wrote:
the PPL can decrease from around 1000 to aroud 110 with 10 epochs,
@glynpu https://github.com/glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#issuecomment-808915648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ .
In our original paper we mention perplexities of 150 and 170.
On Sun, Mar 28, 2021 at 11:56 PM Daniel Povey @.***> wrote:
It would very much depend on the way it was tokenized. It's probably better to divide the total log-prob by the number of words, to get the perplexity per word. I'd guess between about 80 and 200, but that's just a guess.
On Sun, Mar 28, 2021 at 11:53 PM Fangjun Kuang @.***> wrote:
the PPL can decrease from around 1000 to aroud 110 with 10 epochs,
@glynpu https://github.com/glynpu Do you know what is the normal PPL for the LibriSpeech corpus in terms of tokens?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#issuecomment-808915648, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3TIKTRXWMFPDOWH53TF5GJHANCNFSM4ZZGJHRQ .
As shown by RNN-LM experiment in kaldi with librispeech data,
# rnnlm/train_rnnlm.sh: train/dev perplexity was 109.2 / 110.7.
I am studying its configuration and hope to get a comparable ppl with the same data this week.
Yes, probably that modulo method from Kaldi is fine. shuf is not always installed.
On Mon, Mar 29, 2021 at 9:47 AM LIyong.Guo @.***> wrote:
@.**** commented on this pull request.
In egs/librispeech/asr/nnlm/run.sh https://github.com/k2-fsa/snowfall/pull/139#discussion_r602971474:
- --test-file=$full_text \
- --tokenizer-path=$tokenizer +fi
+if [ $stage -eq 4 ]; then
- echo "split all data into train/valid/test"
- full_tokens=${full_text}.tokens
- valid_test_fraction=10 # currently 5 percent for valid and 5 percent for test
- valid_test_tokens=$lm_train/valid_test.tokens
- train_tokens=$lm_train/train.tokens
- num_utts_total=$(wc -l <$full_tokens )
- num_valid_test=$(($num_utts_total/${valid_test_fraction}))
- set +x
- shuf -n $num_valid_test $full_tokens > $valid_test_tokens
Reproducible is important. Maybe the data seperation method of kaldi RNNLM https://github.com/kaldi-asr/kaldi/blob/pybind11/egs/librispeech/s5/local/rnnlm/tuning/run_tdnn_lstm_1a.sh#L75 can be used in following experiments. gunzip -c $text | cut -d ' ' -f2- | awk -v text_dir=$text_dir '{if(NR%2000 == 0) { print >text_dir"/dev.txt"; } else {print;}}'
$text_dir/librispeech.txt
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#discussion_r602971474, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO64IOJT2WBXGTJLV53TF7L3TANCNFSM4ZZGJHRQ .
Good work! I will try to read and understand what you are doing.
On Tue, Mar 30, 2021 at 1:45 PM LIyong.Guo @.***> wrote:
@.**** commented on this pull request.
In egs/librispeech/asr/nnlm/main.py https://github.com/k2-fsa/snowfall/pull/139#discussion_r603797150:
+###############################################################################
+# Load data
+###############################################################################
+corpus = data.Corpus(args.data)
+# Starting from sequential data, batchify arranges the dataset into columns.
+# For instance, with the alphabet as the sequence and batch size 4, we'd get
+# ┌ a g m s ┐
+# │ b h n t │
+# │ c i o u │
+# │ d j p v │
+# │ e k q w │
+# └ f l r x ┘.
+# These columns are treated as independent by the model, which means that the
+# dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
No, training text is not treated as a long sequence. I have modified the data preparation method so that each piece of text is treated independently. Sorry to forget to delete these unrelated original comments. By the way, I am refactoring the training pipeline according these reviews. Temporarily, a new dataset class is located here https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L51, which handle training text one by one and then batchfy them independtly in CollateFunc https://github.com/glynpu/snowfall/blob/88e0d49d559860134bfdf244b38bf25c84fa2c56/egs/librispeech/asr/nnlm/local/dataset.py#L15 .
with open(text_file, 'r') as f: # a line represent a piece of text, e.g. # DELAWARE IS NOT AFRAID OF DOGS for line in f: text = line.strip().split() assert len(text) > 0 text_id = self.text2id(text) # token_id format: # <bos_id> token_id token_id token_id *** <eos_id> token_id = self.text_id2token_id(text_id) self.data.append(token_id)— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/pull/139#discussion_r603797150, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7VKW4C76RZ4LSJVR3TGFQOTANCNFSM4ZZGJHRQ .
Something is not installed...
./run.sh 2&
[1] 73056
de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer
Traceback (most recent call last):
File "local/huggingface_tokenizer.py", line 12, in <module>
from tokenizers import Tokenizer
ModuleNotFoundError: No module named 'tokenizers'
I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?
Something is not installed...
./run.sh 2& [1] 73056 de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer Traceback (most recent call last): File "local/huggingface_tokenizer.py", line 12, in <module> from tokenizers import Tokenizer ModuleNotFoundError: No module named 'tokenizers'I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?
A commit to handle this together with other known bugs will be submitted this afternoon.
Something is not installed...
./run.sh 2& [1] 73056 de-74279-k2-dev-2-0331181900-7b69767657-72fhf:nnlm: training tokenizer Traceback (most recent call last): File "local/huggingface_tokenizer.py", line 12, in <module> from tokenizers import Tokenizer ModuleNotFoundError: No module named 'tokenizers'I don't know easy it is to set it up so things get installed automatically, or at least the user is told what to install?
@danpovey add statement to automatically install dependencies in run.sh
if [ $stage -eq -1 ]; then
# env for experiment ../simple_v1 is expected to have been built.
echo "Install extra dependencies"
pip install -r requirements.txt
fi
Now I am still facing some converging issues. With several epochs, the ppl stuck around 1000. I am not sure where there are some critical unkown bugs or just because of unapproriate hype-parameters configuration.