what is the format of training corpus?

Open marcusau opened this issue 5 years ago • 1 comments

Just raw sentence per line per sentence? or any format required?

Thanks a lot

Feb 18 '21 12:02 marcusau

A corpus is a collection of documents. If your document contains multiple sentences, depending on the use case, you may do one of the following.

Given corpus: List[List[str]], you can:

flatmap each sentence in a document [sentence for doc in corpus for sentence in doc]
or concatenate all sentences into a document-level string [' '.join(doc) for doc in corpus]

Mar 03 '21 17:03 christeefy