RelationalDatasets icon indicating copy to clipboard operation
RelationalDatasets copied to clipboard

A largely incomplete but hopefully useful list of links to datasets for relational learning and inductive logic programming. No guarantees on availability.

Relational Datasets

A largely incomplete but hopefully useful list of links to datasets for relational learning and inductive logic programming. No guarantees on availability.

Classic ILP datasets

A list of datasets per source.

  • The CVUT Prague Relational Dataset Repository: A large collection of ILP datasets, stored as MariaDB (SQL) datasets.

    Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

  • ACE data mining system data sets: nine ILP datasets in Quinlan's FOIL format, together with scripts to convert them into ACE format (see README.txt in the ZIP). These were used in:

    Jan Struyf, Jesse Davis and David Page, An efficient approximation to lookahead in relational learners. In J. Fürnkranz, T. Scheffer and M. Spiliopoulou, editors, Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Proceedings. Lecture Notes in Artificial Intelligence, volume 4212, pages 775-782, Springer, 2006, [Abstract], [BibTeX].

    • Muta188
    • Muta230
    • Financial
    • Sisyphus A
    • Sisyphus B
    • UWCSE
    • Yeast
    • Carcinogenesis
    • Bongard
  • Alchemy

    • Animals
    • CiteSeer
    • Cora
    • Epinions
    • IMDB
    • Kinships
    • Nations
    • Protein Interaction
    • Radish Robot Mapping - Tutorial
    • UMLS
    • UW-CSE
    • WebKB
  • ILP Datasets:: in SQL format

    • Carcinogenesis
    • Financial
    • Trains
    • Mutagenesis
    • Imdb
    • IMDB Top/Botttom Movies
  • Stephen Muggleton's data set directory:

    • Trains
    • alzheimers
    • carcinogenesis
    • chess
    • e_coli
    • mesh
    • more_chess
    • mutagenesis
    • proteins
    • satellite
    • suramin
    • utube
  • Sriraam's StARLinGLAB data sets:

    • Toy Father
    • Toy Cancer
    • IMDB
    • Cora
    • UW-CSE
    • WebKB
    • CiteSeer
    • Boston Housing
    • Drug-Drug Interactions
  • GILPS:

    • alzheimers
    • carcinogenesis
    • dsstox
    • metabolism
    • mutagenesis
    • pyrimidines
    • trains
  • BayesBase: Datasets posted in 3 formats: (i) as a MySQL dump for a relational schema, (ii) in the WILL format, similar to the Aleph ILP input format, (iii) in the .db format of Markov Logic Networks as implemented in the Alchemy system.

    • unielwin
    • Mutagenesis_std
    • MovieLens_std
    • MovieLens_TQ(1M)
    • Financial_std
    • Mondial_std
    • UW_std
    • imdb_MovieLens
    • Hepatitis_std
    • Cont_PLG_TM (Continuous database)
  • LINQS - Statistical Relational Learning Group

    • Social Spammer
    • Drug-Target Interaction
    • Stance Classification
    • CiteSeer for Document Classification
    • CiteSeer for Entity Resolution
    • Cora
    • ArXiv
    • PubMed Diabetes
    • WebKB
    • Terrorists
    • Terrorist Attacks
  • klog Datasets as Prolog files:

    • WebKB: Originally developed by M. Craven et al. (1998). The version available here is a direct conversion to Prolog of the data available at the Alchemy website.
    • Internet Movie Database: Data extracted from this database has been used in a number of relational learning papers. The version available here was downloaded from the IMDb website, converted into SQL using the prodecure described in http://imdbpy.sourceforge.net/docs/README.sqldb.txt and finally a subset of the tuples was converted into a Prolog file.
    • UW-CSE The data set originally developed at University of Washington for demonstrating the capabilities of Markov logic networks. The version available here is a direct conversion to Prolog of the data available at theAlchemy website.
    • Bursi This data set contains 4,337 molecules labeled according to mutagenicity (2,401 mutagens and 1,936 nonmutagens). Originally developed by Kazius et al (2005) it has been used in a number of machine learning papers, especially those studying graph kernels.
    • Biodegradability This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
  • Weka Proper - RELAGGS

  • MLnet
    Among others, some ILP datasets. Note: Internet Archive's Wayback machine link

Other links: