XSum icon indicating copy to clipboard operation
XSum copied to clipboard

download the dataset

Open YuMiaoTHU opened this issue 5 years ago • 19 comments

thanks for your excellent work!

when I run download-bbc-articles.py, it showed that image

I want to konw why, thanks for your help~

YuMiaoTHU avatar Mar 25 '19 01:03 YuMiaoTHU

Maybe accessing WebArxive urls are restricted! If the problem remains, drop me an email.

shashiongithub avatar Mar 25 '19 10:03 shashiongithub

Thanks for your reply! sometimes It' hard for us to access some website....... I still can't get the dataset, could you send me the raw data or the processed data via google drive or dropbox? Thanks for your hard work!

YuMiaoTHU avatar Mar 25 '19 11:03 YuMiaoTHU

I have the same problem. The server is not stable. I downloaded about 2000 data for the first time then i rerun the scripts, it cannot download anymore.

thinkwee avatar Apr 17 '19 04:04 thinkwee

Hi @shashiongithub, I am having similar issues downloading the data with the script. At the moment, I am working on a paper and would love to use xsum dataset for my experiment. I was hoping if you could share them with me through other channels. I tried contacting you through your email but could not get the email to send to your mailbox. My email is [email protected] Thanks a lot!

joelowj avatar Oct 07 '19 14:10 joelowj

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

shashiongithub avatar Nov 25 '19 16:11 shashiongithub

Hi, The link provided above is broken. Is there another way to get the dataset ?

shahbazsyed avatar Apr 03 '20 07:04 shahbazsyed

Hey, I'm not able to open the link either. Can you please help?

mingzi151 avatar Apr 04 '20 02:04 mingzi151

http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

shashiongithub avatar Apr 04 '20 15:04 shashiongithub

Thanks!

shahbazsyed avatar Apr 06 '20 07:04 shahbazsyed

hey! link is broken :/ can you share updated one for me, so i can download the dataset..

fatihbeyhan avatar Apr 20 '20 12:04 fatihbeyhan

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

shashiongithub avatar Apr 22 '20 09:04 shashiongithub

That url creates a dir called bbc-summary-data containing files like bbc-summary-data/{bbcid}.summary. Which code is meant to be run after that to continue preprocessing? bbcid.summary files are not mentioned in the README. Thanks!

sshleifer avatar May 22 '20 17:05 sshleifer

First file bbc-summary-data/10000983.summary looks like this: image

sshleifer avatar May 22 '20 18:05 sshleifer

Few things to keep in mind:

  1. There are some extra summary files here, you should ignore them
    (they have more one sentence in their summary etc).

Please use the training/dev/test ids provided here to find which one to use: https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset

  1. There are few mismatches (between data here and the formats on the Github):

bbc-summary-data/bbcid.summary --> xsum-extracts-from-downloads/bbcid.data

In each summary file: [SN]URL[SN] => [XSUM]URL[XSUM] [SN]TITLE[SN] => Ignore this, not used. [SN]FIRST-SENTENCE[SN] => [XSUM]FIRST-SENTENCE[XSUM] [SN]RESTBODY[SN] => [XSUM]RESTBODY[XSUM]

With these changes the preprocessing scripts should work.

shashiongithub avatar May 22 '20 19:05 shashiongithub

  1. Verifying that I don't need to run prepare_bbc_data.py after doing the SN --> XSUM replacement, right?

  2. Which field is the summary? Or is that in another file?

For context, I'm trying to replicate the results in the bart paper

Thanks!

sshleifer avatar May 25 '20 17:05 sshleifer

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

msadat3 avatar Apr 12 '21 02:04 msadat3

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

The same problem here. Could you please post a new URL?

StevenTang1998 avatar Aug 25 '21 01:08 StevenTang1998

hey! link is broken :/ can you share updated one for me, so i can download the dataset.. try... https://huggingface.co/datasets/xsum/resolve/main/data/XSUM-EMNLP18-Summary-Data-Original.tar.gz

anamtaamin avatar Apr 20 '23 06:04 anamtaamin

It geneates an error if use the downloaded dataset. Please see the details as follows.

While write the abiove-mentioned weblink (listed as follows again)

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

into the xsum.py

_URL_DATA = "http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz"

And then run the code as follows.

from datasets import load_dataset
raw_datasets = load_dataset("xsum.py",  "raw_datasets")

It generates the error as follows..

ReadError: unexpected end of data

The above exception was the direct cause of the following exception: File ~/miniconda3/envs/tf/lib/python3.10/site-packages/datasets/builder.py:1712, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id) 1710 if isinstance(e, SchemaInferenceError) and e.context is not None: 1711 e = e.context -> 1712 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1714 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

The dataset source may have a problem.

Notes:

However, if use the original code, it can run successfully.

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

mikechen66 avatar Sep 22 '23 05:09 mikechen66