BRIO icon indicating copy to clipboard operation
BRIO copied to clipboard

data preprocessing

Open xjw-star opened this issue 2 years ago • 3 comments

Hi Yixin, thank you for this fantastic work. I am reproducing the BRIO model and would like to realize the difference between the data and data.tokenized files, since there seems to be no code to discriminate them.

xjw-star avatar Sep 27 '22 05:09 xjw-star

In fact, I want to preprocess NYT dataset, but there is no off-the-shelf code to achieve them.

xjw-star avatar Sep 27 '22 05:09 xjw-star

Hi it's been a year and i want to know how would you differentiate it? or what other solution did you do?

mrasyadc avatar May 02 '23 00:05 mrasyadc

I'm sorry i wasn't reading it clear enough, but i think tokenized means using the PTB tokenizer right?

QUOTE -- We use the PTB tokenizer provided by Standford CoreNLP (download here). Please note that tokenized texts are only used for evaluation. To tokenize a file, you may run (using test.source as an example)

export CLASSPATH=/your_path/stanford-corenlp-3.8.0.jar cat test.source | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.source.tokenized

We have provided the examples files in ./examples/raw_data. --QUOTE

mrasyadc avatar May 02 '23 00:05 mrasyadc