extraction-framework
extraction-framework copied to clipboard
Commons same as link : the case of french
Issue validity
As explained here : https://forum.dbpedia.org/t/commons-ressources-extractor-problem/1485 I got an issues concerning the commons links from a wikipedia page in French.
Error Description
Please state the nature of your technical emergency: data artifact is empty on the Databus : https://databus.dbpedia.org/dbpedia/generic/commons-sameas-links/2021.09.01/commons-sameas-links_lang=fr.ttl.bz2
Pinpointing the source of the error
- In fact the {{commons}} template is not really used in practice, and the {{Autres projets}} template seems to be prefered over this one, statistics of use let's show us that is a well used pattern in French :http://mappings.dbpedia.org/server/statistics/fr/?show=100000
What i did for the moment :
- i created a shacl test (that i didn't pushed for the moment)
- i fixed it by overwritten locally the related extractor
My questions :
- the {{Autres projets}} template is "in langage" (called {{Sister projects}} in en) is it existing a way to exploit a dictionary for making to fix "multi-lingual" ? Or must i have to add a conditional branch in it (checking for language and string equivalency) ?
- What is the best way if i want to exploit all the "Sisters projects" items ? (wikitionnary/wikinews...)
Thank you by advance !
- How can i SPARQLing a request for knowing if only the french chapter is the only one touched by this trouble? Fixing that problem seems to be sensitive :
Concerning https://databus.dbpedia.org/dbpedia/generic/commons-sameas-links/ these links are only extracted for a bunch of languages all files seem quite small. So it is possible that french is not the only affected language.
- I have the feeling this practice prevents to trigger the Commons mapping. Isn’t it ?
yes, maybe it is enough to adapt the mappings?
- The {{Autres projets}} template aggregate more than just the “commons” links and could led to also extract wiktionnary links, wiki quotes… Have i to consider to just fixing it by a integrating a specific test case depending of the language in the given extractor 1 file ? Or must have to consider the development of a new extractor for the commons links coming from this template ? Option 2 seems better, because of the potential other data that i can grab through that way, but i prefer to getting your expertise on that question !
At first, you should write a minidump test. (you already did, maybe you can do a PR)
- https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/bash/uris.lst add a wikipage here
- https://github.com/dbpedia/extraction-framework/tree/master/dump/src/test/resources/shacl-tests/classes create a SHACL test here
- When i will have fixed it, what is the best way for pushing it?
Create a pull request to the dev
branch
the {{Autres projets}} template is "in langage" (called {{Sister projects}} in en) is it existing a way to exploit a dictionary for making to fix "multi-lingual" ? Or must i have to add a conditional branch in it (checking for language and string equivalency) ?
iirc no dict yet, but should not be necessary if the extractor utilizes the mappings correctly
What is the best way if i want to exploit all the "Sisters projects" items ? (wikitionnary/wikinews...)
I think downloading the specific wikipage dump and do a grep
is the easiest option
https://dumps.wikimedia.org/
Hello @Vehnem and thank you so much for your answers. I took my time to answer you because I still wonder about the Infobox extraction process.
In the French chapter, only a little subset of the declared mapping (http://mappings.dbpedia.org/index.php/Mapping_fr) are named with the pattern "Infobox", in fact, some of these are about insert boxes that are not necessarily an "Infobox" because they could be placed at the end of the Wikipedia article as the following template: https://en.wikipedia.org/wiki/Template:Authority_control
However, few examples as the "ChimieBox" (ChemBox in English), or the "Taxobox" are a kind of Infobox, even if they don't have "InfoBox" in their names.
- Is the "Infobox" pattern required tested for example by a regexp for shifting an extraction through the DIEF ? Are the templates mapping declared in xml file shifted for properties data extraction as for (Authority_control) example?
I investigated this question by using the minidump process on some example of Wikipedia pages that use these templates (https://github.com/datalogism/DBpediaExperiments/blob/main/MappingInfoBoxAnalysis.ipynb)
Following the up-to-date mapping, the ChimieBox is supposed to get us some data : https://github.com/dbpedia/extraction-framework/blob/master/mappings/Mapping_fr.xml. But :
- It returns not information by processing a minidump process on this wikipedia page : https://fr.wikipedia.org/wiki/Abrine
- test-extraction interface returns mapped extracted informations: http://mappings.dbpedia.org/server/extraction/fr/extract?title=Abrine&revid=&format=trix&extractors=mappings
- Are something more in the mini-dump processing that explains these differences of extraction : "infoboxes like templates that not following infoboxes naming convention" VS "infoboxes declared as infoboxes template" ?
I also remark a case: https://en.wikipedia.org/wiki/Football_at_the_2012_Summer_Olympics_%E2%80%93_Men's_tournament_%E2%80%93_Final that use two templates: the "Infobox football match" as infobox and the "Football box" an included properties rescribing in more details the football event.
-> Only the data from the "Infobox football match" template are returned data 3. Why the second template (the property template) didn't return data by the minidump process?
-- 4.By reading myself once again before sending you my message, i have the following intuition : it may be because of the config file managing which extractor to use for the minidump process : https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/resources/extraction-configs/generic-spark.extraction.minidump.properties It seems to use the same extractors as the "global" extraction config (https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.spark.properties) but is it really the case ? Is it existing an extractor that just extracts the "property template" mapped ?
Now for coming back on my original question : Is it preferred to develop a special properties extractor for the {{Sisters projects}} templates properties? Or is it better to include it in a mapping ? And if i do, will it be extracted ?
I am sorry for all these questions, but as newbies, I must be sure of the process and how this one is processing these kinds of data before being able to help the community in the best way.
@datalogism can we just go a step back, and you say what you actually would like to extract, so what kind of triples do you want? If it is e.g. only about commons-sameAs links or "authority template links" I wonder whether it would be best to rely on the wikidata extraction instead?
https://databus.dbpedia.org/dbpedia/wikidata/sameas-all-wikis/ https://databus.dbpedia.org/dbpedia/wikidata/sameas-external/
@JJ-Author, at the base i wanted to get the commons-same-as links, and you right for solving this initial goal your proposed fix is sufficient. But i also understood that i could also get all the links attributed to the "Sister projects" template, i am thinking about the Wiktionary links for example.
This road led me to the questions about the infobox extraction via the mappings that i exposed you above. Because at the first sight two ways could be possible for extracting it :
- via a dedicated extractor (the initial question ask in this thread)
- or by the mapping way (if it is possible to extract by this manner properties objects - focus of my second message)
As I understood the idea of mappings extraction is to create mappings of infobox parameters to the dbpedia ontology. The idea is here that these infoboxes represent a more or less standardized information for a subset entities of the same type. You are right infoboxes are only templates so in theory it could work to define an "infobox" mapping for sister projects. but the template seems more like a generic template that is valid for all types of wikipedia articles (hence i see it more in the generic extraction) --> so my personal intuition would say that a dedicated extractor for it seems the right choice, because these sister projects are not directly tied to the entity but to the page article itself. So if you extract the triples via a mapping these would end up in the mappingbased-objects artifact and I personally think that they are not right there.
with regard to the detailed questions about minidump @Vehnem will write you later
thank you @JJ-Author ! Your arguments are going in the same direction than my first understanding of the infobox, i wanted to be sure of the design philosophy because the mapping files analysis shows me that properties were mapped, as the cited authority control exemple : http://mappings.dbpedia.org/index.php/Mapping_en:Authority_control.
Question : Could these kind of out-of-philosophy mapping affect/alterate the typing given to a entity ?
looking forward the @Vehnem feedback !
Now for coming back on my original question : Is it preferred to develop a special properties extractor for the {{Sisters projects}} templates properties? Or is it better to include it in a mapping ? And if i do, will it be extracted ?
@datalogism I think we should firstly look at existed extractors, maybe some of them have similar logic that we can reuse and achieve what you want. But before checking the extractors we also need to have a clear example of what should be the input and the output from it. So, here are the next things that will help us to solve this issue:
- Send link to some page with this
{{Sister projects}}
template or{{Autres projets}}
- For that page send please expected extracted triples from the
{{Sister projects}}
(or{{Autres projets}}
) template. Below I will show you an example:
So for example we have page https://en.wikipedia.org/wiki/Borysthenia_goldfussiana . And it contains infobox:
{{Taxobox
| name = ''Borysthenia goldfussiana''
| image =
| image_caption =
| status =
| regnum = [[Animal]]ia
| phylum = [[Mollusca]]
...
And InfoboxExtractor (I guess InfoboxExtractor produced them but maybe some another could also produce those triples) produce triples like these (let's also assume that they are also expected extracted triples):
<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://dbpedia.org/property/name> "Borysthenia goldfussiana"@en .
<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://dbpedia.org/property/regnum> "Animalia"@en .
<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://dbpedia.org/property/phylum> <http://dbpedia.org/resource/Mollusca> .
So in similar way please describe what data from some concrete page must be produced. It would be very helpful to know what should be as a subject, predicate, and object. Thank you
Hello @jlareck !
Concerning the Sister projects templates question, almost every French articles have some. We miss the "common same as" triples because as develop the current extractor stand on the use of the {{commons}} template, never used alone in French Wikipedia.
Let's take this exemple : https://fr.wikipedia.org/wiki/Berlin contains the following template at the end of the article :
{{Autres projets
| commons=Category:Berlin
| wiktionary=Berlin
| wikinews=Catégorie:Berlin
| wikivoyage=Berlin
}}
In term of triples we could imagine something like that using owl:SameAs prop, but we could also imagine to create special property for describing it in the ontology (on the example of WikiPageInterLanguageLink prop we could have property called WiktionaryLink) :
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <http://commons.dbpedia.org/resource/Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wiktionary.org/wiki/Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikinews.org/wiki/Cat%C3%A9gorie:Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikivoyage.org/wiki/Berlin> .
For the moment only an extractor for the common exist : https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CommonsResourceExtractor.scala
If we project to integrate the Wiktionnary links, and other Wiki we have thing about how to shape it :
- creating a global extractor able to deal with the {{ Sister projects }} template
- creating one by on an extractor for each other Wiki portals.
@datalogism okay, so, for {{Autres projets}} we can create a new extractor and as initial point for creating it can be InfoboxExtractor
. it is already extracts the data from this template and produce for example triples like these:
<http://fr.dbpedia.org/resource/Antoine_Meillet> <http://fr.dbpedia.org/property/wikisource> "Antoine Meillet"@fr .
<http://fr.dbpedia.org/resource/Antoine_Meillet> <http://fr.dbpedia.org/property/commons> "Category:Antoine Meillet"@fr .
from
{{Autres projets
|wikisource = Antoine Meillet
|commons = Category:Antoine Meillet
}}
So, we can take as a base InfoboxExtractor, modify some parts and produce neccessary triples from this templete.
This is an example of {{Sister project links}} template
{{Sister project links|Angela Merkel|wikt=Merkozy|s=Author:Angela Merkel|display=Angela Merkel}}
As I see, {{Sister project links}} has a different structure and we need to think more about how to handle it. Here I guess we need to use mappings from the properties like s
, wikt
to some other properties.
And it looks like that some parts of this template we need to skip (e.g. |Angela Merkel|
and display=Angela Merkel
) during the extraction, am I right?
Hi, @datalogism, I have implemented a draft extractor for {{Autres projets}}
. You can have a look at it, maybe something can be helpful for you: https://github.com/dbpedia/extraction-framework/blob/a9ed5f0396c82854c8e1663d87571a0935c444ab/core/src/main/scala/org/dbpedia/extraction/mappings/AutresProjectExtractor.scala . It produces triples like:
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <http://fr.commons.dbpedia.org/resource/Category:Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wiktionary.org/wiki/Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikinews.org/wiki/Catégorie:Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikivoyage.org/wiki/Berlin> .
You can execute minidump tests and see those triples in the infobox-properties
dataset (I reused dataset configuration from InfoboxExtractor for this draft implementation of the AutresProjetExtractor
).
As I see, {{Sister project links}} has a different structure and we need to think more about how to handle it. Here I guess we need to use mappings from the properties like s, wikt to some other properties.
I didn't thought about this template, you got it. This one is based on a Lua script defined here : https://en.wikipedia.org/wiki/Module:Sister_project_links. Based on this we could easily adapt in case of a extractor.
This script underline for me two kind of link : the one that we can easily find via a search (generally via the name of the article), and the other that are not obvious : Merkozy is here a good exemple ! And give to the extraction a real added value
I have implemented a draft extractor for {{Autres projets}}. You can have a look at it, maybe something can be helpful for you
Thank you again, @JJ-Author, @Vehnem, @jlareck for you help and support ! I will test this brand-new extractor next days and giving you my feed back !