skip-thoughts
skip-thoughts copied to clipboard
What kind of preprocessing is required for the sentences?
If I have a corpus of documents, each with multiple sentences, how should I preprocess these sentences so that when they are tokenized they yield useful tokens?
For example, should the words be lower-cased, stemmed, punctuation, digits, and stopwords removed? Or is none of this necessary?
None of that is necessary. As long as it is considered a string in Python, it should work. You need stopwords and punctuation because everything in the sentence is used to determine where it ends up in the embedding space.