windbag
windbag copied to clipboard
handle incremental dictionary
Currently, for each dataset, we have to generate a new dictionary, which maps words to id. But if there are new words coming in, we have to retrain the whole model.
I am considering create a dictionary with some placeholders which stands for future-known words. Since Cornell-Movie dataset has about 24k words, maybe creating a 100,000 words dictionary first makes sense for now.
There should be some issue with these future known words. In training dataset, these words are not seen, but possibly in the predicted answers may exist future-known words. I think we can mark these words as
Found something helpful: tf.contrib.layers.scattered_embedding_lookup
.
reference: http://arxiv.org/pdf/1504.04788.pdf