extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

SKOS Category extracted produces some weird triples

Open kurzum opened this issue 3 years ago • 2 comments

Issue validity

Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g. Berlin or Joe_Biden here: http://dief.tools.dbpedia.org/server/extraction/en/ If the issue persists, please post the link from your browser here:

http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Category%3APininfarina&revid=&format=trix&extractors=custom

Error Description

Please state the nature of your technical emergency:

See title,

Pinpointing the source of the error

Where did you find the data issue? Non-exhaustive options are:

  • Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please provide query or link
  • Dumps: dumps are managed by the Databus. Please provide artifact & version or download link
  • DIEF: you ran the software and the error occured then, please include all necessary information such as the extractor or log. If you had problems running the software use another issue template

Should be one of these:

  • https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ArticleCategoriesExtractor.scala
  • https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CategoryLabelExtractor.scala
  • https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/SkosCategoriesExtractor.scala

Also I assume that error is caused by these line on Wikipedia (https://en.wikipedia.org/wiki/Category:Pininfarina)

{{commonscat|Pininfarina}}
{{Cat main|Pininfarina}}

Details

please post the details

Wrong triples RDF snippet

http://dbpedia.org/resource/Category:Pininfarina | http://purl.org/dc/terms/subject | http://dbpedia.org/resource/Pininfarina 
-- | -- | -- | --
http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept |

Expected / corrected RDF outcome snippet

  1. remove the triple starting with http://dbpedia.org/resource/Pininfarina . It is easier, if all extractors just produce triples with the page as subject.
http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept 
  1. use custom property for linking Category: to main article, because dct:subject is definitely the wrong one, i.e. wrong direction and underspecified semantics. I created dbo:mainArticleForCategory http://mappings.dbpedia.org/index.php/OntologyProperty:MainArticleForCategory for this
<http://dbpedia.org/resource/Category:Pininfarina> dbo:mainArticleForCategory <http://dbpedia.org/resource/Pininfarina>

Example DBpedia resource URL(s)


Other

kurzum avatar Sep 03 '21 06:09 kurzum

This data error was in the TopicalConceptsExtractor and so I removed extraction of triples like:

http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept 

and changed dct:subject to dbo: mainArticleForCategory in one of parts of this extractor

jlareck avatar Sep 03 '21 12:09 jlareck

TODO:

  • [ ] update docu in marvin-config
  • [x] fix test (709-711)
  • [x] link https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/resources/shacl-tests/instances/category_pininfarina.ttl
  • [x] manual check whether extra triple is produced

kurzum avatar Nov 18 '21 13:11 kurzum