snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Low hanging fruit: neural language model

Open danpovey opened this issue 4 years ago • 6 comments

Guys,

I realized that there is some very low hanging fruit that could easily make our WERs state of the art, which is neural LM rescoring. An advantage of our framework-- possibly the key advantage-- is that the decoding part is very easy, so we can easily rescore large N-best lists with neural LMs. In addition, it's quite easy to manipulate variable-length sequences, so things like training and using LMs should be a little easier than they otherwise would be.

Here's what I propose: as a relatively easy baseline that can be extended later, we can train a word-piece neural LM (I recommend word-pieces because the vocab size could otherwise be quite large, making the embedding matrices difficult to train). So we'll need: (i) some mechanism to split up words into word-pieces, (ii) data preparation for the LM training, which in the Librispeech case would, I assume, include the additional text training data that Librispeech comes with, (iii) script to train the actual LM. I assume this would be quite similar to our conformer self-attention model, with a xent output (no forward-backward needed), except we'd use a different type of masking, i.e. a mask of size (B, T, T), because we need to limit it to only left-context.

For decoding with the LM, we'd first do a decoding with our n-gram LM, get word sequences using our randomized N-best approach, get their LM scores with our LM, then comput the scores with the n-gram LM and neural LM scores interpolated 50-50 or something like that. [Note: converting the word sequences into word-piece sequences is very easy, we can just do it by indexing a ragged tensor and then removing an axis.]

Dan

danpovey avatar Mar 19 '21 07:03 danpovey

This is a pretty feature-rich and efficient implementation of sub-word tokenizers (with training methods too) https://github.com/huggingface/tokenizers

pzelasko avatar Mar 19 '21 13:03 pzelasko

Cool, thanks for the info!

On Fri, Mar 19, 2021 at 9:53 PM Piotr Żelasko @.***> wrote:

This is a pretty feature-rich and efficient implementation of sub-word tokenizers (with training methods too) https://github.com/huggingface/tokenizers

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/132#issuecomment-802849633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5W5D4HAR5SU4CYB2DTENJOZANCNFSM4ZOG5CIA .

danpovey avatar Mar 19 '21 16:03 danpovey

I am doing this task.

(i) some mechanism to split up words into word-pieces,

Now a tokenizer is trained using https://github.com/huggingface/tokenizers suggested by @pzelasko with librispeech train_960_text(text from train_clean_360, train_clean_100, train_other_500. Currently librispeech-lm-norm.txt is not used), and a demo is shown as below.

image

As shown in the above screenshot : "studying" is tokenized into sequence ('st', '##ud', '##ying' ). @danpovey what do you think about this method?

Next: I am going to train a tokenizer with full librispeech text, i.e. train_960_text(48MB) and librispeech-lm-norm.txt (4GB).

What kind of neural networks for training LM should we try first after data preparation is done? @danpovey I find a reference from espnet which is RNNLM, but I am not sure if it is appropriate for this task.

glynpu avatar Mar 23 '21 13:03 glynpu

Looks cool! My two cents are it’s probably worth it to start with RNNLM and eventually try some autoregressive transformers like GPT2 (small/medium size).

pzelasko avatar Mar 23 '21 14:03 pzelasko

We'll probably be evaluating this in batch mode, not word by word, so some kind of transformer would probably be good from an efficiency point of view, but for prototyping, anything is OK with me. I suppose my main concern is to keep the code relatively simple, as compatible/similar as possible with our AM training code, and not have too many additional dependencies. But anything is OK with me as long as you keep making some kind of progress, as it will all increase your familiarity with the issues.

Just so we can see what you are doing script-wise, if you could make a PR to the repo it would be great. We don't have to worry too much about making the scripts too nice; snowfall is all supposed to be a draft.

danpovey avatar Mar 23 '21 14:03 danpovey

There is another tokenizer that is used in torchtext:

  • https://github.com/explosion/spaCy
  • https://spacy.io/

csukuangfj avatar Apr 06 '21 11:04 csukuangfj