generalized-language-modeling-toolkit pruning of words and n-grams

we could prune all rare sequences and words. and also we don't have this token which others have. it would be interesting to play around with this. also to see if this increases or rather decreases quality of results...

still pruning to the most frequent words and then using unk for "unknown" tokens should be implemented. especially in mod kneser ney one could focus on gaps.

but also afterwards pruning rare n-grams would be an idea. I expect that in this way generalized language models still perform well but use less space.

this should also be discussed with till

Mar 08 '14 11:03 renepickhardt

I guess we won't have any pruning of counts for stable release?

Jan 06 '15 14:01 lschmelzeisen

agree it might be relevant for your bachlorthesis though. Because your beam search - if you use it - will do something similar and saving disk space is especially in the GLM case an important goal.

Jan 06 '15 14:01 renepickhardt