BRIO data preprocessing

data preprocessing

Open xjw-star opened this issue 2 years ago • 3 comments

Hi Yixin, thank you for this fantastic work. I am reproducing the BRIO model and would like to realize the difference between the data and data.tokenized files, since there seems to be no code to discriminate them.

Sep 27 '22 05:09 xjw-star

In fact, I want to preprocess NYT dataset, but there is no off-the-shelf code to achieve them.

Sep 27 '22 05:09 xjw-star

Hi it's been a year and i want to know how would you differentiate it? or what other solution did you do?

May 02 '23 00:05 mrasyadc

I'm sorry i wasn't reading it clear enough, but i think tokenized means using the PTB tokenizer right?

QUOTE -- We use the PTB tokenizer provided by Standford CoreNLP (download here). Please note that tokenized texts are only used for evaluation. To tokenize a file, you may run (using test.source as an example)

export CLASSPATH=/your_path/stanford-corenlp-3.8.0.jar cat test.source | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.source.tokenized

We have provided the examples files in ./examples/raw_data. --QUOTE

May 02 '23 00:05 mrasyadc

BRIO BRIO copied to clipboard

data preprocessing

BRIO
BRIO copied to clipboard