XSum
XSum copied to clipboard
download the dataset
thanks for your excellent work!
when I run download-bbc-articles.py, it showed that
I want to konw why, thanks for your help~
Maybe accessing WebArxive urls are restricted! If the problem remains, drop me an email.
Thanks for your reply! sometimes It' hard for us to access some website....... I still can't get the dataset, could you send me the raw data or the processed data via google drive or dropbox? Thanks for your hard work!
I have the same problem. The server is not stable. I downloaded about 2000 data for the first time then i rerun the scripts, it cannot download anymore.
Hi @shashiongithub, I am having similar issues downloading the data with the script. At the moment, I am working on a paper and would love to use xsum dataset for my experiment. I was hoping if you could share them with me through other channels. I tried contacting you through your email but could not get the email to send to your mailbox. My email is [email protected] Thanks a lot!
Here is the dataset:
http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz
Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.
Hi, The link provided above is broken. Is there another way to get the dataset ?
Hey, I'm not able to open the link either. Can you please help?
http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
Thanks!
hey! link is broken :/ can you share updated one for me, so i can download the dataset..
http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
That url creates a dir called bbc-summary-data
containing files like bbc-summary-data/{bbcid}.summary
.
Which code is meant to be run after that to continue preprocessing? bbcid.summary files are not mentioned in the README. Thanks!
First file bbc-summary-data/10000983.summary
looks like this:
Few things to keep in mind:
- There are some extra summary files here, you should ignore them
(they have more one sentence in their summary etc).
Please use the training/dev/test ids provided here to find which one to use: https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset
- There are few mismatches (between data here and the formats on the Github):
bbc-summary-data/bbcid.summary --> xsum-extracts-from-downloads/bbcid.data
In each summary file: [SN]URL[SN] => [XSUM]URL[XSUM] [SN]TITLE[SN] => Ignore this, not used. [SN]FIRST-SENTENCE[SN] => [XSUM]FIRST-SENTENCE[XSUM] [SN]RESTBODY[SN] => [XSUM]RESTBODY[XSUM]
With these changes the preprocessing scripts should work.
-
Verifying that I don't need to run
prepare_bbc_data.py
after doing the SN --> XSUM replacement, right? -
Which field is the summary? Or is that in another file?
For context, I'm trying to replicate the results in the bart paper
Thanks!
Hello,
I am also not being able to access any links posted in this thread. Could you please post a working URL?
Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.
Hello,
I am also not being able to access any links posted in this thread. Could you please post a working URL?
Update: I found that the posted url works if we do "wget http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz" from a terminal. It does not work from my browser.
The same problem here. Could you please post a new URL?
hey! link is broken :/ can you share updated one for me, so i can download the dataset.. try... https://huggingface.co/datasets/xsum/resolve/main/data/XSUM-EMNLP18-Summary-Data-Original.tar.gz
It geneates an error if use the downloaded dataset. Please see the details as follows.
While write the abiove-mentioned weblink (listed as follows again)
http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz
into the xsum.py
_URL_DATA = "http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz"
And then run the code as follows.
from datasets import load_dataset
raw_datasets = load_dataset("xsum.py", "raw_datasets")
It generates the error as follows..
ReadError: unexpected end of data
The above exception was the direct cause of the following exception: File ~/miniconda3/envs/tf/lib/python3.10/site-packages/datasets/builder.py:1712, in GeneratorBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id) 1710 if isinstance(e, SchemaInferenceError) and e.context is not None: 1711 e = e.context -> 1712 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1714 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
The dataset source may have a problem.
Notes:
However, if use the original code, it can run successfully.
from datasets import load_dataset
raw_datasets = load_dataset("xsum", split="train")