DL-text some unknown words

Hello,when I run the following code : data=['this is a positive sentence', 'this is a negative sentence', 'yet another positve sentence', 'the last one is negative'] wordVec_model = dl.loadGloveModel('glove.6B.50d.txt') data_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10,embedding_dim=50)

the result show following errors:
Loading Glove File..... Loaded Word2Vec GloVe Model..... 400000 words loaded..... found 14 unique words number of unkown words: 4 some unknown words ['$END$', '$START$', 'positve', '$UNK$']

Please help me,thank you very much !

Apr 06 '18 22:04 sunxx772

This is not an error. The dl.process_data module simply prints some of the unknown/undefined words in the pre-trained model. We are using GloVe pre-trained embeddings which have been trained on few million words. Although it provides a wide range of words, yet, there are a lot of words that have not been defined in its vocabulary. In the above example, the word positve is misspelled and therefore there is no way it would have been defined in the GloVe embeddings. Moreover, in dl.process_data, we append the $END$ and $START$ token at the beginning and end of each input sentence (you can think it as padding). Similarly, the $UNK$ is used for undefined words.

Apr 07 '18 15:04 GauravBh1010tt

I see ,thank you very much!

Apr 07 '18 21:04 sunxx772

@sunxx772 @GauravBh1010tt If the description is satisfactory then can we close this one? 🎏

Apr 07 '18 23:04 adityac8

DL-text DL-text copied to clipboard

some unknown words

DL-text
DL-text copied to clipboard