list-extractor
list-extractor copied to clipboard
Extract Data from Wikipedia Lists
List-extractor - Extract Data from Wikipedia Lists
List-Extractor is a tool that can extract information from wikipedia lists and form appropriate RDF triples from the list data.
GSoC'16 Detailed Progress available here
Final commit of GSoC'16 can be found here
GSoC'17 Work's detailed progress available here
List-Extractor wiki available here
GSoC'17 Final results and challenges available here
How to run the tools
This project contains 2 differnt tools: List-Extractor and Rules-Generator.
Use rulesGenerator.py first to generate desired rules, and then use listExtractor.py to extract triples for wiki resources.
Alternatively, you can use only listExtractor.py and extract with existing default settings.
For more details, refer to the documentation present in the docs folder. The sample generated datasets can be found here. Some example triples for different domains are present in extracted folder.
List-Extractor:
python listExtractor.py [collect_mode] [source] [language] [-c class_name]
-
collect_mode:sora- use
sto specify a single resource orafor a class of resources in the next parameter.
- use
-
source: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer. -
language:en,it,deetc. (for now, available only for some languages, for selected domains)- a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.
-
-c --classname: a string representing classnames you want to associate your resource with. Applicable only forcollect_mode="s".
NOTE: While extracting triples from multiple resources in a domain (collect_mode = a), using Ctrl + C will skip the current resource and move on to the next resource. To quit the extractor, use Ctrl + \.
Examples:
python listExtractor.py a Writer itpython listExtractor.py s William_Gibson en: Uses the default inbuilt mapper-functionspython listExtractor.py s William_Gibson en -c CUSTOM_WRITER: Uses theCUSTOM_WRITERmapping only to extract list elements.
If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called extracted.
Rules-Generator:
python rulesGenerator.py
- This is an interactive tool, select the options given in the menu for using the rules generator.
- While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
- Upon successful addition/modification, it will update the
settings.jsonandcustom_mapper.jsonso that the new user defined rules/functions can run with extractor.
Default Mapped Domains:
-
English (
en):- Person:
Writer,Actor,MusicalArtist,Athelete,Polititcian,Manager,Coach,Celebrityetc. - EducationalInstitution:
University,School,College,Library - PeriodicalLiterature:
Magazines,Newspapers,AcademicJournals - Group:
Band
- Person:
-
Other (
it,de,es):Writer,Actor,MusicalArtist
-
More Domains can be added using the
rulesGenerator.pytool.
Attributions for 3rd party tools:
This project uses 2 other existing open source projects.
- JSONpedia, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.
The software is copyright of Michele Mostarda ([email protected]) and released under Apache 2 License. Link : JSONpedia
- JCommander, a very small Java framework that makes it trivial to parse command line parameters.
Contact Cédric Beust ([email protected]) for more information. Released under Apache 2 License. Link : JCommander
Requirements
- Python 2.7
- RDFlib library
- Stable internet connection