SimCLS icon indicating copy to clipboard operation
SimCLS copied to clipboard

If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?

Open lxianl455 opened this issue 3 years ago • 1 comments

I noticed that the unprocessed data should be like the following format: image If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it? Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence? And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?

lxianl455 avatar Dec 29 '21 09:12 lxianl455

Hi,

  • Is the file (without the suffix --- ".tokenized" ) should be filled with the original sentence? Yes.

  • Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ? I used the PTBTokenizer from CoreNLP but it should be okay if you used another one. The tokenized data is only used for evaluation so it would not affect the training.

Let me know if you have more questions.

yixinL7 avatar Jan 03 '22 20:01 yixinL7