wikiasp
wikiasp copied to clipboard
Code for WikiAsp: Multi-document aspect-based summarization.
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization
This repository contains the dataset from the paper "WikiAsp: A Dataset for Multi-domain Aspect-based Summarization".
WikiAsp is a multi-domain, aspect-based summarization dataset in the encyclopedic domain. In this task, models are asked to summarize cited reference documents of a Wikipedia article into aspect-based summaries. Each of the 20 domains include 10 domain-specific pre-defined aspects.

Dataset
Download
WikiAsp is a available via 20 zipped archives, each of which corresponds to a domain. More than 28GB of storage space is necessary to download and store all the domains (unzipped). The following command will download all of them and extract archives:
./scripts/download_and_extract_all.sh /path/to/save_directory
Alternatively, one can individually download an archive for each domain from the table below. (Note: left-clicking will not prompt downloading dialogue. Open the link in a new tab, or save from the context menu on your OS, or use wget.)
| Domain | Link | Size (unzipped) |
|---|---|---|
| Album | Download | 2.3GB |
| Animal | Download | 589MB |
| Artist | Download | 2.2GB |
| Building | Download | 1.3GB |
| Company | Download | 1.9GB |
| EducationalInstitution | Download | 1.9GB |
| Event | Download | 900MB |
| Film | Download | 2.8GB |
| Group | Download | 1.2GB |
| HistoricPlace | Download | 303MB |
| Infrastructure | Download | 1.3GB |
| MeanOfTransportation | Download | 792MB |
| OfficeHolder | Download | 2.0GB |
| Plant | Download | 286MB |
| Single | Download | 1.5GB |
| SoccerPlayer | Download | 721MB |
| Software | Download | 1.3GB |
| TelevisionShow | Download | 1.1GB |
| Town | Download | 932MB |
| WrittenWork | Download | 1.8GB |
Format
Each domain includes three files {train,valid,test}.jsonl, and each line represents one instance in JSON format.
Each instance forms the following structure:
{
"exid": "train-1-1",
"input": [
"tokenized and uncased sentence_1 from document_1",
"tokenized and uncased sentence_2 from document_1",
"...",
"tokenized and uncased sentence_i from document_j",
"..."
],
"targets": [
["a_1", "tokenized and uncased aspect-based summary for a_1"],
["a_2", "tokenized and uncased aspect-based summary for a_2"],
"..."
]
}
where,
- exid:
str - input:
List[str] - targets:
List[Tuple[str,str]]
Here, input is the cited references and consists of tokenized sentences (with NLTK).
The targets key points to a list of aspect-based summaries, where each element is a pair of a) the target aspect and b) the aspect-based summary.
Inheriting from the base corpus, this dataset exhibits the following characteristics:
- Cited references are composed of multiple documents, but the document boundaries are lost, thus expressed simply in terms of list of sentences.
- Sentences in the cited references (
input) are tokenized using NLTK. - The number of target summaries for each instance varies.
Citation
If you use the dataset, please consider citing with
@article{hayashi20tacl,
title = {WikiAsp: A Dataset for Multi-domain Aspect-based Summarization},
author = {Hiroaki Hayashi and Prashant Budania and Peng Wang and Chris Ackerson and Raj Neervannan and Graham Neubig},
journal = {Transactions of the Association for Computational Linguistics (TACL)},
month = {},
url = {https://arxiv.org/abs/2011.07832},
year = {2020}
}
LICENSE

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.