extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

multiple Wikipedia templates handling for mapping to class

Open jimkont opened this issue 11 years ago • 5 comments

submitted by Marco Fossati in [1], [2]

Description

Up to now, the first template infobox on a Wikipedia article defines the DBpedia type of this article, while further infobox templates will be extracted as instances of the corresponding types and own URIs. Hence, the current behavior is to build new URIs in case of multiple templates in one wiki article. However, this may lead to the creation of URIs with double underscores (something like "blank nodes"). The problem is big enough and is likely to affect all other chapters. The objective is to identify and implement extraction strategies that would cover all (or most) cases in Wikipedia. Strategy ideas

Intuitively, when multiple templates occur in the same wiki article, the extractor should generate a unique entity (i.e. subject) and assign all the mapped types and properties to it, no matter where the templates are in the wiki article. This is not a robust strategy, as it indeed fixes some errors in, but may add some other.

Another idea is to declare class disjunction i.e. owl:disjointWith axiom in the DBpedia ontology and raise an error when 2 templates map to disjoint classes.

Examples

  • http://it.wikipedia.org/w/index.php?title=Diabolik_(fumetto)&action=edit -> 'personaggio' and 'fumetto e animazione' templates. 'personaggio' is about the fictional character, 'fumetto e animazione' is about the comic books. In this case, it wouldn't make sense to attach all properties to one subject URI. We need two different valid URIs.
  • http://it.wikipedia.org/w/index.php?title=Alfredo_Binda&action=edit -> 'sportivo' and 'bio' templates. It's pretty clear that both templates are about the person, and it would make a lot of sense to attach all extracted properties to the same subject URI.
  • http://es.wikipedia.org/w/index.php?title=Jacques_Chirac&action=edit -> 'Ficha de autoridad' and 'Ficha de criminal' templates. In this case, it would be nice to use the same subject URI for both templates, but the info from 'Ficha de criminal' seems by far not as important.

links

[1] https://sourceforge.net/mailarchive/message.php?msg_id=30369047 [2] https://sourceforge.net/mailarchive/message.php?msg_id=29907224

jimkont avatar Mar 18 '13 14:03 jimkont

Regarding "invalid URIs with double underscores" - these are not invalid, just maybe harder to use and understand in some cases.

jcsahnwaldt avatar Mar 18 '13 14:03 jcsahnwaldt

You are right, I just copied Marco's text. I changed it

jimkont avatar Mar 18 '13 15:03 jimkont

Cool!

jcsahnwaldt avatar Mar 18 '13 15:03 jcsahnwaldt

Don't we have already something here? https://github.com/dbpedia/extraction-framework/pull/4

ninniuz avatar Mar 25 '13 17:03 ninniuz

Not exactly, The problem here is that they want to use the second+ template as the mapping for the main resource

jimkont avatar Mar 25 '13 23:03 jimkont