kenlm How to incorporate unknowns in Lm Build process?

How to incorporate unknowns in Lm Build process?

Open ankitmundada opened this issue 7 years ago • 1 comments

As mentioned here, <unk>s are replaced with space when building the arpa file. Does that mean the sentences where such replacement occurs, their correctness is compromised? I use <unk> tokens to limit the vocab size, but still account for words with very low frequency in the dataset (rare words). Can using filter function solve this as mentioned here, by passing a limited vocabulary to it?

Apr 10 '18 13:04 ankitmundada

The filter is designed to use less memory by removing words that won't be queried by a given system. It's not what you're looking for.

Replacing low-frequency words with <unk> breaks assumptions made by Kneser-Ney smoothing. I never did get around to implementing a closed-vocabulary LM with explicit <unk> contained in the text, but would welcome a patch.

Apr 29 '18 12:04 kpu

kenlm kenlm copied to clipboard

How to incorporate unknowns in Lm Build process?

kenlm
kenlm copied to clipboard