SimCLS
SimCLS copied to clipboard
If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?
I noticed that the unprocessed data should be like the following format:
If I want to create a new dataset (not CNN/DailyMail and XSum) , what should I prepare for it?
Is the file (without the suffix --- ".tokenized" ) should be filled with the origin sentence?
And Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ?
Hi,
-
Is the file (without the suffix --- ".tokenized" ) should be filled with the original sentence? Yes.
-
Which tokenizer should be used to tokenize the sentence in the file with the suffix --- ".tokenized" ? I used the PTBTokenizer from CoreNLP but it should be okay if you used another one. The tokenized data is only used for evaluation so it would not affect the training.
Let me know if you have more questions.