kenlm icon indicating copy to clipboard operation
kenlm copied to clipboard

Negative weights obtained from interpolation

Open tomassykora opened this issue 5 years ago • 2 comments

Hi, I've computed interpolation on 5 text corpuses. Two of them got negative weights. I'd understand zero weights. But what do negative values really mean? Not sure whether it's a bug or it's expected behaviour.

I wanted to use those weights while training an RNN language model in kaldi, which fails on computing unigram probabilities as it relies on the weights being positive.

It'd be a pity setting those weights to zero and loosing those texts as they are two big text corpuses which can help in the production predictions.

tomassykora avatar Mar 23 '20 15:03 tomassykora

Log-linear interpolation is far from perfect. But consider the domain adaptation effect.
Your corpora could be:

  1. YouTube comments
  2. All text on YouTube.
  3. The web.

If optimized for YouTube comments, arguably the model should attempt to subtract out the web UI part from all text on YouTube?

kpu avatar Mar 27 '20 18:03 kpu

Also, have you tried just concatenating everything and adding that to the mix?

kpu avatar Mar 27 '20 18:03 kpu