DL-text icon indicating copy to clipboard operation
DL-text copied to clipboard

some unknown words

Open sunxx772 opened this issue 6 years ago • 3 comments

Hello,when I run the following code : data=['this is a positive sentence', 'this is a negative sentence', 'yet another positve sentence', 'the last one is negative'] wordVec_model = dl.loadGloveModel('glove.6B.50d.txt') data_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10,embedding_dim=50)

the result show following errors:
Loading Glove File..... Loaded Word2Vec GloVe Model..... 400000 words loaded..... found 14 unique words number of unkown words: 4 some unknown words ['$END$', '$START$', 'positve', '$UNK$']

Please help me,thank you very much !

sunxx772 avatar Apr 06 '18 22:04 sunxx772

This is not an error. The dl.process_data module simply prints some of the unknown/undefined words in the pre-trained model. We are using GloVe pre-trained embeddings which have been trained on few million words. Although it provides a wide range of words, yet, there are a lot of words that have not been defined in its vocabulary. In the above example, the word positve is misspelled and therefore there is no way it would have been defined in the GloVe embeddings. Moreover, in dl.process_data, we append the $END$ and $START$ token at the beginning and end of each input sentence (you can think it as padding). Similarly, the $UNK$ is used for undefined words.

GauravBh1010tt avatar Apr 07 '18 15:04 GauravBh1010tt

I see ,thank you very much!

sunxx772 avatar Apr 07 '18 21:04 sunxx772

@sunxx772 @GauravBh1010tt If the description is satisfactory then can we close this one? 🎏

adityac8 avatar Apr 07 '18 23:04 adityac8