pytorch-sentiment-analysis icon indicating copy to clipboard operation
pytorch-sentiment-analysis copied to clipboard

Question in fasttext

Open Dongximing opened this issue 3 years ago • 1 comments

Hi Ben In fasttext, we can use bi-gram to get tokens, for example, I love your --> in bi-gram--> I, love, u, I love, love u. so if we set a max_vaob is 4( I, love, u, I love. in the vocab ). and initially word embedding is vectors = "glove.6B.100d". How to represent "I love" in word embedding. since I think we can not get initial word embedding for "I love" in the glove. so does "unk_init = torch.Tensor.normal_" solve this problem ( to use the normal distribution to initialize "I love" )? also will "love u" be looked as "UNK" and will set embedding to 0 ? (since we suppose 4 vocabs in dictionary )

Dongximing avatar Jan 17 '22 08:01 Dongximing

All words not in the GloVe vocabulary will have their embedding initialized to a normally distributed vector, which is what the unk_init does. As the GloVe embedding only contains single words, then all bi-grams will be initialized in this way.

"love u" will be set to UNK only if it is not in the 25,000 most common tokens/bi-grams. If you have a lot of examples then "love u" will be in the vocabulary and thus not UNK'd. This will depend on your dataset, but the main reason for using bi-grams is that there in fact will be plenty of bi-grams that appear enough times to appear in the 25,000 most common tokens/bi-grams and won't be UNK'd. See this comment: https://github.com/bentrevett/pytorch-sentiment-analysis/issues/69, which shows that in the IMDB dataset 65% of the vocabulary is actually bi-grams.

bentrevett avatar Feb 10 '22 11:02 bentrevett