Transformers-Domain-Adaptation icon indicating copy to clipboard operation
Transformers-Domain-Adaptation copied to clipboard

what is the format of training corpus?

Open marcusau opened this issue 5 years ago • 1 comments

Just raw sentence per line per sentence? or any format required?

Thanks a lot

marcusau avatar Feb 18 '21 12:02 marcusau

A corpus is a collection of documents. If your document contains multiple sentences, depending on the use case, you may do one of the following.

Given corpus: List[List[str]], you can:

  • flatmap each sentence in a document [sentence for doc in corpus for sentence in doc]
  • or concatenate all sentences into a document-level string [' '.join(doc) for doc in corpus]

christeefy avatar Mar 03 '21 17:03 christeefy