extraction-framework
extraction-framework copied to clipboard
SKOS Category extracted produces some weird triples
Issue validity
Some explanation: DBpedia Snapshot is produced every three months, see Release Frequency & Schedule, which is loaded into http://dbpedia.org/sparql . During these three months, Wikipedia changes and also the DBpedia Information Extraction Framework receives patches. At http://dief.tools.dbpedia.org/server/extraction/en/ we host a daily updated extraction web service that can extract one Wikipedia page at a time. To check whether your issue is still valid, please enter the article name, e.g.
Berlin
orJoe_Biden
here: http://dief.tools.dbpedia.org/server/extraction/en/ If the issue persists, please post the link from your browser here:
http://dief.tools.dbpedia.org/server/extraction/en/extract?title=Category%3APininfarina&revid=&format=trix&extractors=custom
Error Description
Please state the nature of your technical emergency:
See title,
Pinpointing the source of the error
Where did you find the data issue? Non-exhaustive options are:
- Web/SPARQL, e.g. http://dbpedia.org/sparql or http://dbpedia.org/resource/Berlin, please provide query or link
- Dumps: dumps are managed by the Databus. Please provide artifact & version or download link
- DIEF: you ran the software and the error occured then, please include all necessary information such as the extractor or log. If you had problems running the software use another issue template
Should be one of these:
- https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/ArticleCategoriesExtractor.scala
- https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CategoryLabelExtractor.scala
- https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/SkosCategoriesExtractor.scala
Also I assume that error is caused by these line on Wikipedia (https://en.wikipedia.org/wiki/Category:Pininfarina)
{{commonscat|Pininfarina}}
{{Cat main|Pininfarina}}
Details
please post the details
Wrong triples RDF snippet
http://dbpedia.org/resource/Category:Pininfarina | http://purl.org/dc/terms/subject | http://dbpedia.org/resource/Pininfarina
-- | -- | -- | --
http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept |
Expected / corrected RDF outcome snippet
- remove the triple starting with http://dbpedia.org/resource/Pininfarina . It is easier, if all extractors just produce triples with the page as subject.
http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept
- use custom property for linking
Category:
to main article, becausedct:subject
is definitely the wrong one, i.e. wrong direction and underspecified semantics. I createddbo:mainArticleForCategory
http://mappings.dbpedia.org/index.php/OntologyProperty:MainArticleForCategory for this
<http://dbpedia.org/resource/Category:Pininfarina> dbo:mainArticleForCategory <http://dbpedia.org/resource/Pininfarina>
Example DBpedia resource URL(s)
Other
This data error was in the TopicalConceptsExtractor and so I removed extraction of triples like:
http://dbpedia.org/resource/Pininfarina | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/2004/02/skos/core#Concept
and changed dct:subject
to dbo: mainArticleForCategory
in one of parts of this extractor
TODO:
- [ ] update docu in marvin-config
- [x] fix test (709-711)
- [x] link https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/resources/shacl-tests/instances/category_pininfarina.ttl
- [x] manual check whether extra triple is produced