Transformers-Domain-Adaptation
Transformers-Domain-Adaptation copied to clipboard
what is the format of training corpus?
Just raw sentence per line per sentence? or any format required?
Thanks a lot
A corpus is a collection of documents. If your document contains multiple sentences, depending on the use case, you may do one of the following.
Given corpus: List[List[str]], you can:
- flatmap each sentence in a document
[sentence for doc in corpus for sentence in doc] - or concatenate all sentences into a document-level string
[' '.join(doc) for doc in corpus]