extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

enable several classes per entity

Open VladimirAlexiev opened this issue 10 years ago • 12 comments

Hamid Ghofrani [[email protected]] For Elvis_Presley, the DBpedia types are just http://dbpedia.org/ontology/Agent http://dbpedia.org/ontology/MilitaryPerson http://dbpedia.org/ontology/Person

Wikipedia has this: https://en.wikipedia.org/w/index.php?title=Elvis_Presley&action=edit

{{Infobox person
| occupation   = Singer, actor
| module = {{Infobox military person
| module2 = {{Infobox musical artist
  | instrument   = Vocals, guitar, piano
   | background   = solo_singer
   | genre        = {{flat list|
*[[Rock and roll]]
*[[Pop music|pop]]
  ...

The newest extraction is here: http://mappings.dbpedia.org/server/extraction/en/extract?title=Elvis_Presley&revid=&format=turtle-triples&extractors=custom

Unfortunately DBpedia processes only the first two infboxes (Person and Military person) but not Musical artist. It even skips the instrument, background and genre fields from the third infobox (Musical artist). Gerard Kuys has remarked that DBpedia picks only one leaf class "to avoid contradictions". I can understand that various infoboxes scattered throughout the article could contribute non-sensical classes, especially if they have non-sense mappings like Mapping_el:Quote_box

However:

  • The above two "modules" are not randomly scattered, they are embedded in the main infobox template
  • How is "contradiction" defined"? Definitely the subclasses of Person are not disjoint, there are numerous examples.

VladimirAlexiev avatar Feb 18 '15 13:02 VladimirAlexiev

@jimkont said: One problem is that we do not process embedded templates (Infobox musical artist)which is mainly a design issue. I am not aware who made it in the past, it is quite easy to change it but not sure of the implications of such a change. (Currently it extracts neither MilitaryPerson nor MusicalArtist: both are nested).

Sometimes it helps to look at the state of the articles at the time of extraction http://en.wikipedia.org/w/index.php?title=Elvis_Presley&action=edit&oldid=606258011 DBpedia assigns a single type for each resource and creates separate ones for subsequent mapped templates if they are not direct subclasses/superclasses of the first mapped template in this case we had an infobox Person followed by a infobox military person (not nested)

VladimirAlexiev avatar Feb 19 '15 16:02 VladimirAlexiev

The same problem "do not process embedded templates" causes https://github.com/dbpedia/mappings-tracker/issues/46

VladimirAlexiev avatar Feb 19 '15 16:02 VladimirAlexiev

http://sourceforge.net/p/dbpedia/mailman/message/32867924/

@jimkont said: it is very trivial to change but needs testing... any volunteers from the community? I can provide an adapted version of the code and also dumps but someone needs to look at the data

Sure: Boyan can deploy it locally, and I’ll look at the data. Gimme test cases, so far I got:

  • Film_date (https://github.com/dbpedia/mappings-tracker/issues/46)
  • Elvis should be Person, MilitaryPerson and MusicalArtist (https://github.com/dbpedia/extraction-framework/issues/341)

Is the logic "pick one out of several disjoint classes" documented precisely somewhere? And use cases/test cases? @jcsahnwaldt?

I don't know but I have an uneasy feeling about such logic. If templateA says classA and templateB says classB, seems to me the extractor itself can't make an intelligent decision to drop one of them.

  • Either the maps are correct and both classes should be emitted (and what an ontologist thought are disjoint classes, the data proves are not)
  • Or a map needs to be fixed (eg Listen is not a class but an IntermediateNodeMapping with class Sound and relation soundRecording)
  • Or a template is wrongly applied (eg in bgwiki, "Musical Artist" was mis-applied to "BG at the World Cup 1994") Don't see room for Artificial Intelligence here ;-)

VladimirAlexiev avatar Feb 24 '15 07:02 VladimirAlexiev

I don't think this feature is properly documented or tested, but the comments are pretty good:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/TemplateMapping.scala#L43

jcsahnwaldt avatar Feb 24 '15 12:02 jcsahnwaldt

It might seem reasonable to ascribe the types of all infoboxes to the main resource, but one prominent counterexample given in the comments is https://en.wikipedia.org/wiki/Volkswagen_Golf - lots of infoboxes describing specific Golf models. Attaching all their data to the main resource wouldn't be useful.

jcsahnwaldt avatar Feb 24 '15 12:02 jcsahnwaldt

The code that extracts all templates from a page is in this branch https://github.com/jimkont/extraction-framework/tree/multi-template-mapping

I did some experiments in Dutch diffs: https://www.dropbox.com/sh/3gjfrou29lmgxad/AAB41KYHJyTSCu9jnbu4LmjVa?dl=0

triple stats:

   8361422 nlwiki-20141209-instance-types.ttl.all
   8469742 nlwiki-20141209-instance-types.ttl.top
  12694261 nlwiki-20141209-mappingbased-properties.ttl.all
  12717493 nlwiki-20141209-mappingbased-properties.ttl.top

From a superficial look it mostly adds types to untyped resources due to the following mapping http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia these types look wrong but needs some further investigation if by removing this mapping things get improved.

It would be nice to test this in other languages and English

jimkont avatar Feb 24 '15 22:02 jimkont

I see, that uses correspondingProperty & correspondingClass.

  • (I didnt know correspondingProperty can be used outside of IntermediateNodeMapping)
  • I expanded the description with another example: http://mappings.dbpedia.org/index.php/Template:TemplateMapping#Example3:_Multiple_of_the_Same_Template
  • And I just couldn't resist writing this: http://mappings.dbpedia.org/index.php/Volkswagen_Golf_Jokes: read it if you have some free time :-)

@jcsahnwaldt When there's explicit correspondingClass, the "pick one out of several disjoint classes" logic does not apply. But I'll read those source comments...

VladimirAlexiev avatar Feb 25 '15 02:02 VladimirAlexiev

@jimkont "Bronvermelding anderstalige Wikipedia" means "Sources in other-language Wikipedias", eg

* {{Bronvermelding anderstalige Wikipedia|taal=de|titel=Archimedes|datum=20140414}}
* {{Bronvermelding anderstalige Wikipedia|taal=en|titel=Archimedes|datum=20140414}}

So these are stale (non-Wikidata) Interlanguage links. Quick killing is recommended.

@boyan-simeonov: can you please install https://github.com/jimkont/extraction-framework/tree/multi-template-mapping locally so I can test it?

VladimirAlexiev avatar Feb 25 '15 02:02 VladimirAlexiev

  • removed http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia
  • added to https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/statistics/ignorelist_nl.txt
  • asked the Dutch: https://nl.wikipedia.org/wiki/Wikipedia:Wikiproject/Check_Wikipedia#Delete_template_Bronvermelding_anderstalige_Wikipedia

VladimirAlexiev avatar Feb 25 '15 02:02 VladimirAlexiev

@roland-c can you check the dutch diffs for possible errors?

jimkont avatar Feb 25 '15 07:02 jimkont

The {{Bronvermelding anderstalige Wikipedia}} template should not be read as being an interlanguage link. It is there to comply with the CC-BY-SA license of the source material. It's probably more appropriate to compare it to, say, {{Cite web}}. I don't see any mappings for that, so it's probably inappropriate to have one for Bronvermelding_anderstalige_Wikipedia.

frankgeerlings avatar Feb 25 '15 09:02 frankgeerlings

The diffs are full of errors (dbpo:Article) because of http://mappings.dbpedia.org/index.php/Mapping_nl:Bronvermelding_anderstalige_Wikipedia, which is now removed. A new diff including Elvis would realy help to verify correct results.

roland-c avatar Feb 25 '15 11:02 roland-c