GloVe
GloVe copied to clipboard
Special Tokens in glove.twitter.27B
Hi,
Can you explain meaning of special tokens in glove.twitter.27B?
for example: repeat, allcaps, elong, ...
Well, there are over 1 million words in the vocabulary, including those. Whenever you do a web scrape, all sorts of crazy unicode and other words show up. For the most part, you would use these word vectors by checking terms that come up in your own application against the dictionary, so you can safely ignore most of the surprising combinations of symbols.
So there is no preprocessing before learning word vectors? for example replacing all user mentions with
So there is no preprocessing before learning word vectors? for example replacing all user mentions with <