extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Estimated Class & Property usage statistics

Open jimkont opened this issue 9 years ago • 3 comments

In the mapping server we already provide statistics about the template & template property mapping coverage.

A great addition would be an estimated class & property instance count based on the template statistics and the existing mappings

e.g. the English Persondata template

  • http://mappings.dbpedia.org/index.php/Mapping_en:Persondata
  • http://mappings.dbpedia.org/server/statistics/en/
  • http://mappings.dbpedia.org/server/templatestatistics/en/?template=Persondata

is mapped to class Person and has 1,190,939 instances in English Wikipedia which in theory will provide 1,190,939 instances of the class Person For the same template, the template property NAME has 1,176,765 and mapped to foaf:name indicating 1,176,765 instances of foaf:name in the resulting data

Of course aggregating all this number will not be accurate because the Persondata template might be in the same page with Infobox person template but nevertheless will give us an overview of each class and property usage and a way to clean up the ontology of unused properties

related to: http://wiki.dbpedia.org/gsoc2015/ideas#h460-10

jimkont avatar Mar 06 '15 11:03 jimkont

As a GSoC warm-up task this does not need to be fully integrated in the server module, even an offline generation should be sufficient

jimkont avatar Mar 06 '15 11:03 jimkont

I would like to work on this! Roughly, what you need is to compute statistics of properties usage in a template, don't you? I don't understand what you mean by: "aggregating all these numbers will not be accurate because the Persondata template might be in the same page with Infobox Person". Why is that the case?

jvican avatar Mar 29 '15 12:03 jvican

Perect @jvican ! Exactly, something like how many occurrences dbo:birthDate could possibly have based on all the existing mappings & template statistics and if possibly per language. Similar for Classes

The reason this number will not be accurate is that 1) not all dates can be correctly parsed or some might extract multiple values and 2) with the infobox Person & Persondate templates they may have the exact value for birth date which will should for one instead of two values. The second case is more affected for the class statistics where we would count 2+ instances instead of 1 in many cases

Either-way, this is not meant to be accurate but to give a usage estimate to see what properties / classes can be deleted / merged / rearranged

jimkont avatar Mar 29 '15 13:03 jimkont