magpie
magpie copied to clipboard
Why not remove more stop-words in text processing???
def get_all_words(self): """ Return all words tokenized, in lowercase and without punctuation """ return [w.lower() for w in word_tokenize(self.text) if w not in string.punctuation]
I found that in this function, only punctuation of the text was removed. But there are other types of words that have not been removed.
eg:
from nltk.corpus import stopwords words = stopwords.words('english')
yeah, we want to leave the stopwords in for word2vec to work better.