DL-text
DL-text copied to clipboard
some unknown words
Hello,when I run the following code : data=['this is a positive sentence', 'this is a negative sentence', 'yet another positve sentence', 'the last one is negative'] wordVec_model = dl.loadGloveModel('glove.6B.50d.txt') data_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10,embedding_dim=50)
the result show following errors:
Loading Glove File.....
Loaded Word2Vec GloVe Model.....
400000 words loaded.....
found 14 unique words
number of unkown words: 4
some unknown words ['$END$', '$START$', 'positve', '$UNK$']
Please help me,thank you very much !
This is not an error. The dl.process_data
module simply prints some of the unknown/undefined words in the pre-trained model.
We are using GloVe pre-trained embeddings which have been trained on few million words. Although it provides a wide range of words, yet, there are a lot of words that have not been defined in its vocabulary. In the above example, the word positve is misspelled and therefore there is no way it would have been defined in the GloVe embeddings.
Moreover, in dl.process_data
, we append the $END$ and $START$ token at the beginning and end of each input sentence (you can think it as padding). Similarly, the $UNK$ is used for undefined words.
I see ,thank you very much!
@sunxx772 @GauravBh1010tt If the description is satisfactory then can we close this one? 🎏