lingvo Why the results of LM fusion worse than without LM fusion?

Hi! I finished shallow LM fusion in ASR using your ComputeLogitsWithLM in asr/fusion.py. Then, I have tested it many times using different parameters which maybe affect LM fusion results. However, most of them got worse results than without LM fusion.

What I do in ComputeLogitsWithLM is:

compute the next log logits based on this log logits using LM
next log logits = this log logits + lambda * next log logits computed by LM
return next log logits

The following are results:

It is different methods tested on test-other dataset. "beam" is num_hyps_per_beam in beam search, "beam_size" is beam_size in beam search. Here, I also want to ask whether the num_hyps_per_beam is the beam width(I took num_hyps_per_beam as beam width)?

When beam ==1, it has a little bit of use(16.19%->16.00%, lambda=0.1). However, when beam == 8, it becomes worse than without LM fusion(15.45%->15.68%, lambda=0.01) no matter how I change lambda(0.2->0.1->0.5->0.01). So, why I can't get better results like you using LM fusion(which referred in paper)? Could you give me some advice about it? Thank you~

Jul 01 '19 17:07 iamxiaoyubei

It is difficult to tell why no improvement is seen without looking into the implementation. I would however, make sure that LM trained is good enough. You could do n-best rescoring and see if are any improvement with same LM. There could also be issue with model training. What data you used to train LM? If there is mismatch in Librispeech domain which is read speech and if LM is trained on different domain, you may not see any improvement. Even analysis of n-best with and without fusion can help with analysis.

Jul 02 '19 04:07 datavizweb

I have tested 8-beam beam search and it has no improvement. I use librispeech train datasets' transcripts to train LM. Could you tell me what is the n-best rescoring you said?(n-beam beam search?) How can I directly use your code to do n-best rescoring? Besides, I am confused about second pass rescoring in your paper.

Jul 02 '19 05:07 iamxiaoyubei

Hi, I'm confused about non-deterministic beam search results. Here down below are some wer results when beam_size == 3 and using pure logits without fuison (logits = tf.add(0.0 * state.step_out, 1.0 * logits)). So, anyone met this before? Any help from you will be greatly appreciated.

no fusion, beam_size == 3 wer = 0.045496043 wer = 0.046237826 wer = 0.045629185 wer = 0.045857426

this may affects fusion judgement.

Jul 08 '19 03:07 alpha19881007

wath this.

Jul 13 '19 09:07 zh794390558

I observed the same phenomenon. Do you compute PPL with your language model? I found that the performance of my RNNLM trained with lingvo is not good enough.

Jul 25 '19 08:07 by2101

I think it should be log_softmax rather than log logits.

Jul 25 '19 08:07 by2101

@by2101 Is log_pplx in RNNLM PPL? The following two pictures are "log_pplx" and "log_pplx_per_word". But my loss curve is strange.

Jul 26 '19 08:07 iamxiaoyubei

It should be log_pplx_per_word. The training ppl of your model (exp(5) = 143) is relatively big. I found that the LM trained with lingvo does not perform well. I don't know why. The loss is the the sum of the losses.

Jul 26 '19 08:07 by2101

It is really big, but looking at the curve I feel that the language model has converged. Has your training language model converge? What is your PPL result finally?

Jul 26 '19 08:07 iamxiaoyubei

I have not train LM on Librispeech. However, in my experiments, the PPL is worse than 3gram. I am still tuning it.

Jul 26 '19 08:07 by2101

I used "xent_output.log_probs" to compute. It is already log_softmax.

I think it should be log_softmax rather than log logits.

Jul 26 '19 08:07 iamxiaoyubei

Can your 3 gram or other lm fusion improve ASR?

Jul 26 '19 08:07 iamxiaoyubei

It is difficult to integrate 3gram into lingvo asr decoder, so I have not done it.

Jul 26 '19 08:07 by2101

Hahaha, yes. I think so. I would like to ask, how much better is your 3 gram PPL?

Jul 26 '19 08:07 iamxiaoyubei

It is our own data. So the absolute PPL value is not comparable.

Jul 26 '19 09:07 by2101

Soga~ I have been thinking about why lm fusion on lingvo is not good. We can communicate more in the future~

Jul 26 '19 09:07 iamxiaoyubei

Soga~ I have been thinking about why lm fusion on lingvo is not good. We can communicate more in the future~

Hi, I would like to know how you implement the fusion code. I first use activations = self.lm.rnns.rnn[num_layers-1].cell.GetOutput(lm_states.rnn[num_layers-1]) to get activations. Then, I use lm_logits = self.lm.softmax.Logits(theta=self.lm.softmax.theta,inputs=activations) to get the logits. I think this code is complex. So I would like to know whether you have simple way to implement it.

Jul 28 '19 02:07 by2101

Thanks for the helpful discussion on LM fusion.

When incorporating an LM into an end-to-end model via shallow fusion, transcript truncation is a common failure mode. Shorter sentences often have a lower log prob according to the LM; for example "hello </s>" is going to have a lower log prob than "hello how are you </s>". As a result, when we use an interpolated score like p_ASR(y) + lambda * p_LM(y) incomplete transcripts can get quite a boost and sometimes end up having the best overall score.

In my experiments, incorporating the LM will not help at all (and in fact often hurts WER) unless you also do something to counteract this failure mode. Have you inspected your output to see if this might be the issue? Are you using any sort of penalty or normalization to address the transcript truncations?

If not, here is some additional info that may be helpful --

Several different types of penalties have been explored to counteract transcript truncation, including coverage penalty, EOS emission constraint, and length normalization. This paper presents a nice description and comparison of these: https://arxiv.org/abs/1612.02695

For the Librispeech shallow fusion model reported in our paper we used both a coverage penalty and an EOS emission constraint. See footnote 7 in the paper.

Typically I tune coverage penalty in the range of [0, 0.05] in increments of 0.01. It can be set using the coverage_penalty param in BeamSearchHelper. I also tune the EOS emission constraint in the range of [0, 5.0] in increments of 1.0. It can be set using the valid_eos_max_logit_delta param in BeamSearchHelper.

Jul 29 '19 14:07 anjuli

Thank you very much! Very helpful information!

Jul 30 '19 09:07 by2101

lingvo lingvo copied to clipboard

Why the results of LM fusion worse than without LM fusion?

lingvo
lingvo copied to clipboard