BRIO
BRIO copied to clipboard
data preprocessing
Hi Yixin, thank you for this fantastic work. I am reproducing the BRIO model and would like to realize the difference between the data and data.tokenized files, since there seems to be no code to discriminate them.
In fact, I want to preprocess NYT dataset, but there is no off-the-shelf code to achieve them.
Hi it's been a year and i want to know how would you differentiate it? or what other solution did you do?
I'm sorry i wasn't reading it clear enough, but i think tokenized means using the PTB tokenizer right?
QUOTE -- We use the PTB tokenizer provided by Standford CoreNLP (download here). Please note that tokenized texts are only used for evaluation. To tokenize a file, you may run (using test.source as an example)
export CLASSPATH=/your_path/stanford-corenlp-3.8.0.jar cat test.source | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.source.tokenized
We have provided the examples files in ./examples/raw_data. --QUOTE