XSum download the dataset

download the dataset

Open YuMiaoTHU opened this issue 5 years ago • 19 comments

thanks for your excellent work！

when I run download-bbc-articles.py, it showed that

I want to konw why, thanks for your help~

Mar 25 '19 01:03 YuMiaoTHU

Maybe accessing WebArxive urls are restricted! If the problem remains, drop me an email.

Mar 25 '19 10:03 shashiongithub

Thanks for your reply! sometimes It' hard for us to access some website....... I still can't get the dataset, could you send me the raw data or the processed data via google drive or dropbox? Thanks for your hard work!

Mar 25 '19 11:03 YuMiaoTHU

I have the same problem. The server is not stable. I downloaded about 2000 data for the first time then i rerun the scripts, it cannot download anymore.

Apr 17 '19 04:04 thinkwee

Hi @shashiongithub, I am having similar issues downloading the data with the script. At the moment, I am working on a paper and would love to use xsum dataset for my experiment. I was hoping if you could share them with me through other channels. I tried contacting you through your email but could not get the email to send to your mailbox. My email is [email protected] Thanks a lot!

Oct 07 '19 14:10 joelowj

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

Nov 25 '19 16:11 shashiongithub

Hi, The link provided above is broken. Is there another way to get the dataset ?

Apr 03 '20 07:04 shahbazsyed

Hey, I'm not able to open the link either. Can you please help?

Apr 04 '20 02:04 mingzi151

http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Apr 04 '20 15:04 shashiongithub

Thanks!

Apr 06 '20 07:04 shahbazsyed

hey! link is broken :/ can you share updated one for me, so i can download the dataset..

Apr 20 '20 12:04 fatihbeyhan

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Apr 22 '20 09:04 shashiongithub

That url creates a dir called bbc-summary-data containing files like bbc-summary-data/{bbcid}.summary. Which code is meant to be run after that to continue preprocessing? bbcid.summary files are not mentioned in the README. Thanks!

May 22 '20 17:05 sshleifer

First file bbc-summary-data/10000983.summary looks like this:

May 22 '20 18:05 sshleifer

Few things to keep in mind:

There are some extra summary files here, you should ignore them
(they have more one sentence in their summary etc).

Please use the training/dev/test ids provided here to find which one to use: https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset

There are few mismatches (between data here and the formats on the Github):

bbc-summary-data/bbcid.summary --> xsum-extracts-from-downloads/bbcid.data

In each summary file: [SN]URL[SN] => [XSUM]URL[XSUM] [SN]TITLE[SN] => Ignore this, not used. [SN]FIRST-SENTENCE[SN] => [XSUM]FIRST-SENTENCE[XSUM] [SN]RESTBODY[SN] => [XSUM]RESTBODY[XSUM]

With these changes the preprocessing scripts should work.

May 22 '20 19:05 shashiongithub

Verifying that I don't need to run prepare_bbc_data.py after doing the SN --> XSUM replacement, right?
Which field is the summary? Or is that in another file?

For context, I'm trying to replicate the results in the bart paper

Thanks!

May 25 '20 17:05 sshleifer

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

Apr 12 '21 02:04 msadat3

Hello,

I am also not being able to access any links posted in this thread. Could you please post a working URL?

Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.

The same problem here. Could you please post a new URL?

Aug 25 '21 01:08 StevenTang1998

hey! link is broken :/ can you share updated one for me, so i can download the dataset.. try... https://huggingface.co/datasets/xsum/resolve/main/data/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Apr 20 '23 06:04 anamtaamin

It geneates an error if use the downloaded dataset. Please see the details as follows.

While write the abiove-mentioned weblink (listed as follows again)

http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

into the xsum.py

_URL_DATA = "http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz"

And then run the code as follows.

from datasets import load_dataset
raw_datasets = load_dataset("xsum.py",  "raw_datasets")

It generates the error as follows..

ReadError: unexpected end of data

The above exception was the direct cause of the following exception: File ~/miniconda3/envs/tf/lib/python3.10/site-packages/datasets/builder.py:1712, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id) 1710 if isinstance(e, SchemaInferenceError) and e.context is not None: 1711 e = e.context -> 1712 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1714 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

The dataset source may have a problem.

Notes:

However, if use the original code, it can run successfully.

from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")

Sep 22 '23 05:09 mikechen66

XSum XSum copied to clipboard

download the dataset

XSum
XSum copied to clipboard