skip-thoughts icon indicating copy to clipboard operation
skip-thoughts copied to clipboard

What kind of preprocessing is required for the sentences?

Open pcg108 opened this issue 8 years ago • 1 comments

If I have a corpus of documents, each with multiple sentences, how should I preprocess these sentences so that when they are tokenized they yield useful tokens?

For example, should the words be lower-cased, stemmed, punctuation, digits, and stopwords removed? Or is none of this necessary?

pcg108 avatar Nov 30 '16 02:11 pcg108

None of that is necessary. As long as it is considered a string in Python, it should work. You need stopwords and punctuation because everything in the sentence is used to determine where it ends up in the embedding space.

danielricks avatar Jan 03 '17 20:01 danielricks