neuralconvo
neuralconvo copied to clipboard
Maximum Vocabulary Size
Hi @macournoyer
We are replacing the words with "unknown" after we have encountered unique words equal to the vocab size.
if self.maxVocabSize > 0 and self.wordsCount >= self.maxVocabSize then
-- We've reached the maximum size for the vocab. Replace w/ unknown token
return self.unknownToken
end
I think we might get better results, if we replace the words with
This might be a reason of inferior results when we restrict the vocabulary.
Yes it might be the reason. But restricting based on frequency ends up being a lot more difficult to implement. Since you have to rewrite all the examples because word IDs will change when you remove a word.
@macournoyer
I think we could take a pass on the dataset (only the lines used for the training) to count the frequencies of the words and then keep removing the words in decreasing order of frequency till we hit the vocabulary size.
I think @chenb67 has already done it in the PR .
We would not have to rewrite examples if dataset size and vocabulary size is same. In other case, we would have to ! If it improves the accuracy, it is worthwhile i guess :)
Hey guys- I have a fork that does that TorchNeuralConvo
And that's basically how I did it. (Order the vocab by count and then replace)
But, there are some tricks if you want to stay within the LuaJIT memory limits, but load huge files