extraction-framework
extraction-framework copied to clipboard
Estimated Class & Property usage statistics
In the mapping server we already provide statistics about the template & template property mapping coverage.
A great addition would be an estimated class & property instance count based on the template statistics and the existing mappings
e.g. the English Persondata
template
- http://mappings.dbpedia.org/index.php/Mapping_en:Persondata
- http://mappings.dbpedia.org/server/statistics/en/
- http://mappings.dbpedia.org/server/templatestatistics/en/?template=Persondata
is mapped to class Person and has 1,190,939 instances in English Wikipedia
which in theory will provide 1,190,939 instances of the class Person
For the same template, the template property NAME
has 1,176,765 and mapped to foaf:name indicating 1,176,765 instances of foaf:name
in the resulting data
Of course aggregating all this number will not be accurate because the Persondata
template might be in the same page with Infobox person
template but nevertheless will give us an overview of each class and property usage and a way to clean up the ontology of unused properties
related to: http://wiki.dbpedia.org/gsoc2015/ideas#h460-10
As a GSoC warm-up task this does not need to be fully integrated in the server module, even an offline generation should be sufficient
I would like to work on this! Roughly, what you need is to compute statistics of properties usage in a template, don't you? I don't understand what you mean by: "aggregating all these numbers will not be accurate because the Persondata template might be in the same page with Infobox Person". Why is that the case?
Perect @jvican !
Exactly, something like how many occurrences dbo:birthDate
could possibly have based on all the existing mappings & template statistics and if possibly per language. Similar for Classes
The reason this number will not be accurate is that 1) not all dates can be correctly parsed or some might extract multiple values and 2) with the infobox Person
& Persondate
templates they may have the exact value for birth date which will should for one instead of two values. The second case is more affected for the class statistics where we would count 2+
instances instead of 1
in many cases
Either-way, this is not meant to be accurate but to give a usage estimate to see what properties / classes can be deleted / merged / rearranged