meta
meta copied to clipboard
language model estimation
language_model
needs the ability to estimate from a corpus instead of requiring a .arpa file
We should consider this formulation (scroll down for the actual paper): http://homepages.inf.ed.ac.uk/s0562315/progs/#pldlm
Kenneth even mentions it in his thesis as future work. It looks like it isn't much harder than modified interpolated knesser-ney while giving better perplexity.
Once we implement this, we should have some way of saving the tokenization setup to ensure queries using the LM are tokenized the same way.