pruning of words and n-grams
we could prune all rare sequences and words. and also we don't have this
still pruning to the most frequent words and then using unk for "unknown" tokens should be implemented. especially in mod kneser ney one could focus on
but also afterwards pruning rare n-grams would be an idea. I expect that in this way generalized language models still perform well but use less space.
this should also be discussed with till
I guess we won't have any pruning of counts for stable release?
agree it might be relevant for your bachlorthesis though. Because your beam search - if you use it - will do something similar and saving disk space is especially in the GLM case an important goal.