extraction-framework
extraction-framework copied to clipboard
multiple Wikipedia templates handling for mapping to class
submitted by Marco Fossati in [1], [2]
Description
Up to now, the first template infobox on a Wikipedia article defines the DBpedia type of this article, while further infobox templates will be extracted as instances of the corresponding types and own URIs. Hence, the current behavior is to build new URIs in case of multiple templates in one wiki article. However, this may lead to the creation of URIs with double underscores (something like "blank nodes"). The problem is big enough and is likely to affect all other chapters. The objective is to identify and implement extraction strategies that would cover all (or most) cases in Wikipedia. Strategy ideas
Intuitively, when multiple templates occur in the same wiki article, the extractor should generate a unique entity (i.e. subject) and assign all the mapped types and properties to it, no matter where the templates are in the wiki article. This is not a robust strategy, as it indeed fixes some errors in, but may add some other.
Another idea is to declare class disjunction i.e. owl:disjointWith axiom in the DBpedia ontology and raise an error when 2 templates map to disjoint classes.
Examples
- http://it.wikipedia.org/w/index.php?title=Diabolik_(fumetto)&action=edit -> 'personaggio' and 'fumetto e animazione' templates. 'personaggio' is about the fictional character, 'fumetto e animazione' is about the comic books. In this case, it wouldn't make sense to attach all properties to one subject URI. We need two different valid URIs.
- http://it.wikipedia.org/w/index.php?title=Alfredo_Binda&action=edit -> 'sportivo' and 'bio' templates. It's pretty clear that both templates are about the person, and it would make a lot of sense to attach all extracted properties to the same subject URI.
- http://es.wikipedia.org/w/index.php?title=Jacques_Chirac&action=edit -> 'Ficha de autoridad' and 'Ficha de criminal' templates. In this case, it would be nice to use the same subject URI for both templates, but the info from 'Ficha de criminal' seems by far not as important.
links
[1] https://sourceforge.net/mailarchive/message.php?msg_id=30369047 [2] https://sourceforge.net/mailarchive/message.php?msg_id=29907224
Regarding "invalid URIs with double underscores" - these are not invalid, just maybe harder to use and understand in some cases.
You are right, I just copied Marco's text. I changed it
Cool!
Don't we have already something here? https://github.com/dbpedia/extraction-framework/pull/4
Not exactly, The problem here is that they want to use the second+ template as the mapping for the main resource