XSum icon indicating copy to clipboard operation
XSum copied to clipboard

Dataset

Open xiyan524 opened this issue 6 years ago • 19 comments

Thanks for your excellent works.

Would you mind provide XSum dataset directly just like CNN/Dialy Mail that we are familiar with? I believe it may save time and be more convenient for experiments.

I'd be appreciate if you could give any help. Thanks~

xiyan524 avatar Feb 17 '19 01:02 xiyan524

Could you drop me an email and tell me what problems are you having with the download?

shashiongithub avatar Feb 17 '19 10:02 shashiongithub

Thanks a lot~My email is [email protected]

Actually I do not met any problem yet, while I am pressed to do some experiments so that direct dataset may be more helpful. I will try by myself to get dataset when I at my leisure.

xiyan524 avatar Feb 17 '19 12:02 xiyan524

And I have a question about some parameters in your model. As I have understand that some parameters like t_d(the topic distribution of the document D) is obtained from pre-trained LDA. I am curious about if this vector will be trained during the training process? In other words, vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? thx~

xiyan524 avatar Feb 17 '19 12:02 xiyan524

@xiyan524 你开始用中文数据训练了吗 , 卡住了, 不知道从哪里入手好. 按照理解 , 作者说过可以用fasttext,bert 效果会更加好,那个应该是说词向量. 作者输入词向量那段代码,在哪里 , 有没有找到〒▽〒, 已经找到fasttext和bert怎么生成中文的词向量了

kedimomo avatar Feb 18 '19 03:02 kedimomo

@chenbaicheng 抱歉,我没有用文章所提出的模型,只是比较感兴趣XSum这个数据集。

xiyan524 avatar Feb 18 '19 07:02 xiyan524

@xiyan524 谢谢

kedimomo avatar Feb 18 '19 07:02 kedimomo

"vector from pre-trained LDA is just a initial value for training or a fixed value which will not be changed? ..." Yes, pre-trained LDA vectors are fixed during training. It varies for different documents and for different words in every document.

shashiongithub avatar Feb 18 '19 21:02 shashiongithub

@shashiongithub I got it. thx

xiyan524 avatar Feb 19 '19 01:02 xiyan524

Hello @shashiongithub I am also having trouble downloading the dataset. After rerunning the script > 75 times I still have 11 articles that cannot be downloaded. I would like to make a fair comparison with your results that uses exactly the same train/test split.

To facilitate further research experimentation and development with this dataset could you make it available directly?

artidoro avatar Nov 21 '19 16:11 artidoro

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

shashiongithub avatar Nov 25 '19 16:11 shashiongithub

I downloaded the tar file above and it is in a different format than is expected for the script scripts/xsum-preprocessing-convs2s.py. Can you please share instructions for how to convert the data in the tar file to what this script expects? Thanks.

isabelcachola avatar Mar 09 '20 21:03 isabelcachola

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

From here, you can follow the instructions in the dataset README starting at the section, Postprocessing: Sentence Segmentation, Tokenization, Lemmatization and Final preparation.

As a side note, I am using a different version of the Stanford CoreNLP Toolkit (stanford-corenlp-full-2018-10-05), so I had to change this for loop in scripts/process-corenlp-xml-data.py to the following:

      for doc_sent, doc_sentlemma in zip(doc_sentences, doc_sentlemmas):
        clean_doc_sent = re.sub(r'\\ ', '', doc_sent)
        if "-LSB- XSUM -RSB- URL -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "URL"
          allcovered += 1
        elif "-LSB- XSUM -RSB- FIRST-SENTENCE -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "INTRODUCTION"
          allcovered += 1
        elif "-LSB- XSUM -RSB- RESTBODY -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "RestBody"
          allcovered += 1
        else:
          if modeFlag == "RestBody":
            restbodydata.append(doc_sent)
            restbodylemmadata.append(doc_sentlemma)
          if modeFlag == "INTRODUCTION":
            summarydata.append(doc_sent)

isabelcachola avatar Mar 10 '20 20:03 isabelcachola

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

Hi! Thanks for providing the code. I'm wondering which decoder you used for the text files? When I use the same code as you provided, I have the following error: `line 14, in
text_in = open(join(bbc_dir, fname)).read()

 UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 12896: illegal multibyte sequence`

matt9704 avatar Jul 08 '20 22:07 matt9704

Hi, I cant access the link, Can you please fix it?

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

fajri91 avatar Jul 13 '20 11:07 fajri91

Shay suggested to try this: http://bollin.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz

shashiongithub avatar Jul 14 '20 09:07 shashiongithub

Here is the dataset:

http://kinloch.inf.ed.ac.uk/public/XSUM-EMNLP18-Summary-Data-Original.tar.gz

Please use train, development and test ids from github to split into subsets. Let me know if you have any questions.

hello , thanks a lot for sharing data. After download data and unzip it. the folder contains *.summary. And .summary contains URL,TITLE,FIRST-SENTENCE,RESTBODY which is different from expected format. In README what should i do next? use StanfordNLP toolkit? It seems that xsum-preprocessing-convs2s Requires 2 kind of file (.document and *.summary which is different from provided data)

Ricardokevins avatar Nov 01 '21 15:11 Ricardokevins

Hi, I just have one question. What is the total number of instances? I got 237002 after preprocessing the files downloaded from the above given bollin.inf.ed.uk website. Is that same in your case because it is found that the number of instances is around 226000 in huggingfaces website

sriram487 avatar Jan 11 '22 16:01 sriram487

For anyone who is trying to format the data in the link above, this is what I did to get it in the right format.

First, I used the following quick script to reformat the data

from os import listdir
from os.path import isfile, join
import re
from tqdm import tqdm 

bbc_dir = '/path/to/bbc-summary-data'
out_dir = '/path/to/XSum/XSum-Dataset/xsum-extracts-from-downloads'

bbc_files = [f for f in listdir(bbc_dir) if isfile(join(bbc_dir, f))]

for fname in tqdm(bbc_files):
    with open(join(out_dir, f'{fname.split(".")[0]}.data'), 'w') as f_out:
        text_in = open(join(bbc_dir, fname)).read()
        text_out = re.sub(r'\[SN\]', '\[XSUM\]', text_in)
        f_out.write(text_out)

From here, you can follow the instructions in the dataset README starting at the section, Postprocessing: Sentence Segmentation, Tokenization, Lemmatization and Final preparation.

As a side note, I am using a different version of the Stanford CoreNLP Toolkit (stanford-corenlp-full-2018-10-05), so I had to change this for loop in scripts/process-corenlp-xml-data.py to the following:

      for doc_sent, doc_sentlemma in zip(doc_sentences, doc_sentlemmas):
        clean_doc_sent = re.sub(r'\\ ', '', doc_sent)
        if "-LSB- XSUM -RSB- URL -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "URL"
          allcovered += 1
        elif "-LSB- XSUM -RSB- FIRST-SENTENCE -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "INTRODUCTION"
          allcovered += 1
        elif "-LSB- XSUM -RSB- RESTBODY -LSB- XSUM -RSB-" in clean_doc_sent:
          modeFlag = "RestBody"
          allcovered += 1
        else:
          if modeFlag == "RestBody":
            restbodydata.append(doc_sent)
            restbodylemmadata.append(doc_sentlemma)
          if modeFlag == "INTRODUCTION":
            summarydata.append(doc_sent)

Hello, I used the process-corenlp-xml-data.py to process the bbcid.data.xml files but i got a error which says some information is missing. /stanfordOutput/bbcid.data.xml

It will be great if u help me in this issue, Thanks

sriram487 avatar Jan 16 '22 07:01 sriram487

If anyone still has problems about:

  1. download and split XSum
  2. evaluate fine-tuned BART on XSum You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction

BaohaoLiao avatar Mar 10 '23 19:03 BaohaoLiao