SimCLS If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?

Open lxianl455 opened this issue 3 years ago • 1 comments

I noticed that the unprocessed data should be like the following format: If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence? And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?

Dec 29 '21 09:12 lxianl455

Hi,

Is the file (without the suffix --- ".tokenized" ) should be filled with the original sentence? Yes.
Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ? I used the PTBTokenizer from CoreNLP but it should be okay if you used another one. The tokenized data is only used for evaluation so it would not affect the training.

Let me know if you have more questions.

Jan 03 '22 20:01 yixinL7

SimCLS SimCLS copied to clipboard

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?

SimCLS
SimCLS copied to clipboard