GTWiki icon indicating copy to clipboard operation
GTWiki copied to clipboard

Dataset for the paper: "A multi-task semi-supervised framework for Text2Graph & Graph2Text"

GTWiki

GTWiki is a non-parallel dataset for Text-To-Graph (parsing) & Graph-To-Text (generation) tasks. It is used in the framework implemented in our paper: "A multi-task semi-supervised framework for Text2Graph & Graph2Text".

Frame 27

Non-parallel data

GTWiki can be used for unsupervised learning. The text and graphs are collected from the same entities (176,000) regarding Wikipedia and Wikidata.

  • English text: 240,024 instances (one sentence or more per each) of 459.67 characters of average length.
  • Graphs: 271,095 instances (1 to 6 triples per each).

Data available at data/monolingual.txt and data/graphs.txt respectively.

Collection

Alternatively, you can run our collection script and customize it for your needs:

python3 collect.py [WIKIDATA_ID] [WIKIPEDIA_NAME] [MAX_DEPTH]

For example:

python3 collect.py Q762 "Leonardo da Vinci" 1

This execution will collect both, text and graphs, from Leonardo da Vinci and his children in the graph.

Please, for more information about the collection algorithm see our paper.

Requirements

Previous steps requires Python >= 3.6. One can install all requiremets executing:

pip3 install -r requirements.txt

Citation

If you find our work, data or the code useful, please consider to cite our paper.

@misc{domingo2022multitask,
      title={A multi-task semi-supervised framework for Text2Graph & Graph2Text}, 
      author={Oriol Domingo and Marta R. Costa-jussà and Carlos Escolano},
      year={2022},
      eprint={2202.06041},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}