skip-thoughts What kind of preprocessing is required for the sentences?

What kind of preprocessing is required for the sentences?

Open pcg108 opened this issue 8 years ago • 1 comments

If I have a corpus of documents, each with multiple sentences, how should I preprocess these sentences so that when they are tokenized they yield useful tokens?

For example, should the words be lower-cased, stemmed, punctuation, digits, and stopwords removed? Or is none of this necessary?

Nov 30 '16 02:11 pcg108

None of that is necessary. As long as it is considered a string in Python, it should work. You need stopwords and punctuation because everything in the sentence is used to determine where it ends up in the embedding space.

Jan 03 '17 20:01 danielricks

skip-thoughts skip-thoughts copied to clipboard

What kind of preprocessing is required for the sentences?

skip-thoughts
skip-thoughts copied to clipboard