lingvo
lingvo copied to clipboard
Why the results of LM fusion worse than without LM fusion?
Hi! I finished shallow LM fusion in ASR using your ComputeLogitsWithLM
in asr/fusion.py. Then, I have tested it many times using different parameters which maybe affect LM fusion results. However, most of them got worse results than without LM fusion.
What I do in ComputeLogitsWithLM
is:
- compute the next log logits based on this log logits using LM
- next log logits = this log logits + lambda * next log logits computed by LM
- return next log logits
The following are results:
It is different methods tested on test-other dataset. "beam" is num_hyps_per_beam in beam search, "beam_size" is beam_size in beam search. Here, I also want to ask whether the num_hyps_per_beam is the beam width(I took num_hyps_per_beam as beam width)?
When beam ==1, it has a little bit of use(16.19%->16.00%, lambda=0.1). However, when beam == 8, it becomes worse than without LM fusion(15.45%->15.68%, lambda=0.01) no matter how I change lambda(0.2->0.1->0.5->0.01). So, why I can't get better results like you using LM fusion(which referred in paper)? Could you give me some advice about it? Thank you~
It is difficult to tell why no improvement is seen without looking into the implementation. I would however, make sure that LM trained is good enough. You could do n-best rescoring and see if are any improvement with same LM. There could also be issue with model training. What data you used to train LM? If there is mismatch in Librispeech domain which is read speech and if LM is trained on different domain, you may not see any improvement. Even analysis of n-best with and without fusion can help with analysis.
I have tested 8-beam beam search and it has no improvement. I use librispeech train datasets' transcripts to train LM. Could you tell me what is the n-best rescoring you said?(n-beam beam search?) How can I directly use your code to do n-best rescoring? Besides, I am confused about second pass rescoring in your paper.
Hi, I'm confused about non-deterministic beam search results. Here down below are some wer results when beam_size == 3 and using pure logits without fuison (logits = tf.add(0.0 * state.step_out, 1.0 * logits)). So, anyone met this before? Any help from you will be greatly appreciated.
no fusion, beam_size == 3 wer = 0.045496043 wer = 0.046237826 wer = 0.045629185 wer = 0.045857426
this may affects fusion judgement.
wath this.
I observed the same phenomenon. Do you compute PPL with your language model? I found that the performance of my RNNLM trained with lingvo is not good enough.
I think it should be log_softmax rather than log logits.
@by2101 Is log_pplx in RNNLM PPL? The following two pictures are "log_pplx" and "log_pplx_per_word".
But my loss curve is strange.
It should be log_pplx_per_word. The training ppl of your model (exp(5) = 143) is relatively big. I found that the LM trained with lingvo does not perform well. I don't know why. The loss is the the sum of the losses.
It is really big, but looking at the curve I feel that the language model has converged. Has your training language model converge? What is your PPL result finally?
I have not train LM on Librispeech. However, in my experiments, the PPL is worse than 3gram. I am still tuning it.
I used "xent_output.log_probs" to compute. It is already log_softmax.
I think it should be log_softmax rather than log logits.
Can your 3 gram or other lm fusion improve ASR?
It is difficult to integrate 3gram into lingvo asr decoder, so I have not done it.
Hahaha, yes. I think so. I would like to ask, how much better is your 3 gram PPL?
It is our own data. So the absolute PPL value is not comparable.
Soga~ I have been thinking about why lm fusion on lingvo is not good. We can communicate more in the future~
Soga~ I have been thinking about why lm fusion on lingvo is not good. We can communicate more in the future~
Hi, I would like to know how you implement the fusion code. I first use
activations = self.lm.rnns.rnn[num_layers-1].cell.GetOutput(lm_states.rnn[num_layers-1])
to get activations. Then, I use
lm_logits = self.lm.softmax.Logits(theta=self.lm.softmax.theta,inputs=activations)
to get the logits. I think this code is complex. So I would like to know whether you have simple way to implement it.
Thanks for the helpful discussion on LM fusion.
When incorporating an LM into an end-to-end model via shallow fusion, transcript truncation is a common failure mode. Shorter sentences often have a lower log prob according to the LM; for example "hello </s>"
is going to have a lower log prob than "hello how are you </s>"
. As a result, when we use an interpolated score like p_ASR(y) + lambda * p_LM(y)
incomplete transcripts can get quite a boost and sometimes end up having the best overall score.
In my experiments, incorporating the LM will not help at all (and in fact often hurts WER) unless you also do something to counteract this failure mode. Have you inspected your output to see if this might be the issue? Are you using any sort of penalty or normalization to address the transcript truncations?
If not, here is some additional info that may be helpful --
Several different types of penalties have been explored to counteract transcript truncation, including coverage penalty, EOS emission constraint, and length normalization. This paper presents a nice description and comparison of these: https://arxiv.org/abs/1612.02695
For the Librispeech shallow fusion model reported in our paper we used both a coverage penalty and an EOS emission constraint. See footnote 7 in the paper.
Typically I tune coverage penalty in the range of [0, 0.05] in increments of 0.01. It can be set using the coverage_penalty param in BeamSearchHelper
. I also tune the EOS emission constraint in the range of [0, 5.0] in increments of 1.0. It can be set using the valid_eos_max_logit_delta param in BeamSearchHelper
.
Thank you very much! Very helpful information!