magpie icon indicating copy to clipboard operation
magpie copied to clipboard

Why not remove more stop-words in text processing???

Open JiaWenqi opened this issue 5 years ago • 1 comments

def get_all_words(self): """ Return all words tokenized, in lowercase and without punctuation """ return [w.lower() for w in word_tokenize(self.text) if w not in string.punctuation] I found that in this function, only punctuation of the text was removed. But there are other types of words that have not been removed. eg: from nltk.corpus import stopwords words = stopwords.words('english')

JiaWenqi avatar Mar 13 '19 07:03 JiaWenqi

yeah, we want to leave the stopwords in for word2vec to work better.

jstypka avatar Mar 13 '19 09:03 jstypka