kenlm icon indicating copy to clipboard operation
kenlm copied to clipboard

How to incorporate unknowns in Lm Build process?

Open ankitmundada opened this issue 7 years ago • 1 comments

As mentioned here, <unk>s are replaced with space when building the arpa file. Does that mean the sentences where such replacement occurs, their correctness is compromised? I use <unk> tokens to limit the vocab size, but still account for words with very low frequency in the dataset (rare words). Can using filter function solve this as mentioned here, by passing a limited vocabulary to it?

ankitmundada avatar Apr 10 '18 13:04 ankitmundada

The filter is designed to use less memory by removing words that won't be queried by a given system. It's not what you're looking for.

Replacing low-frequency words with <unk> breaks assumptions made by Kneser-Ney smoothing. I never did get around to implementing a closed-vocabulary LM with explicit <unk> contained in the text, but would welcome a patch.

kpu avatar Apr 29 '18 12:04 kpu