XSum icon indicating copy to clipboard operation
XSum copied to clipboard

How to use dataset

Open pyfisch opened this issue 5 years ago • 1 comments

Hi,

thanks for providing the dataset as a download. I downloaded the dataset from the location mentioned in https://github.com/EdinburghNLP/XSum/issues/12#issuecomment-558241165 But it appears that the format of the dataset is different from the files you receive if you dowload the data yourself.

See this gist, the first file 12092740.data I downloaded myself from archive.org, while the second file was part of the dowloaded dataset.

As you can see the downloaded file contains the attributes [XSUM]URL[XSUM], [XSUM]INTRODUCTION[XSUM] and [XSUM]RESTBODY[XSUM]. But the file from the dataset has [SN]URL[SN], [SN]TITLE[SN], [SN]FIRST-SENTENCE[SN] and [SN]RESTBODY[SN].

My problem is that if I follow the tutorial at https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset the scripts don't work with the unmodified files.

Which changes do I need to make to the scripts?

Best, Pyfisch

pyfisch avatar Feb 06 '20 14:02 pyfisch

@pyfisch I had the same issue and was able to resolve it with a quick data processing script, described here. Hope this helps!

isabelcachola avatar Mar 16 '20 22:03 isabelcachola