berkeley-doc-summarizer Preparing the dataset

Preparing the dataset

Open danbenn opened this issue 7 years ago • 1 comments

Your instructions mention:

To prepare the dataset, first you need to extract all the XML files from 2003-2007 and flatten them into a single directory

Is 2003-2007 referring to train_corefner_standoff or train_abstracts_standoff?

Within each of these directories, the files contained don't seem to have an XML format.

Not sure how to do the aforementioned step...

Jun 26 '17 03:06 danbenn

Oh I see, it looks like you're referring to a NYT dataset that costs $300 for those who aren't members of the Linguistic Data Consortium. Yikes. My university canceled its membership unfortunately :(

EDIT: Whoops, just found the part that says:

The system is distributed with several pre-trained variants

As far as I can tell, I should be able to run one of the pre-trained summarizers without the NYT data set.

Jun 26 '17 03:06 danbenn

berkeley-doc-summarizer berkeley-doc-summarizer copied to clipboard

Preparing the dataset

berkeley-doc-summarizer
berkeley-doc-summarizer copied to clipboard