dataprop extractor: language doesn't handle lang tag sr-Cyrl
template: http://mappings.dbpedia.org/index.php/Template:PropertyMapping says:
- language: if the datatype is of rdf:langString we can define the language of the language tag using the wikipedia language code (e.g. language = de) "datatype" is incorrect, should say "range"
property: http://mappings.dbpedia.org/index.php/OntologyProperty:Foaf:name
- does have rdfs:range rdf:langString
mapping: http://mappings.dbpedia.org/index.php?title=Mapping_fr:Infobox_Ville_de_Serbie&action=edit has
{{PropertyMapping | templateProperty = nom | ontologyProperty = foaf:name | language = fr }}
{{PropertyMapping | templateProperty = nom_cyrillique | ontologyProperty = foaf:name | language = sr-Cyrl }}
wiki page: https://fr.wikipedia.org/w/index.php?title=Požega_(Serbie)&action=edit has
| nom_cyrillique = Пожега
result: http://mappings.dbpedia.org/server/extraction/fr/extract?title=Požega_(Serbie)&revid=&format=turtle-triples&extractors=custom
- has foaf:name "Požega"@fr
- doesn't have foaf:name "Пожега"@sr-Cyrl
Maybe the dataprop extractor has the wrong idea what can a lang tag be? That above is a valid lang tag meaning "lang=Serbian, script=Cyrillic"
This is critical, because we want to fix 10-15 lang-specific props to foaf:name with lang tag: http://mappings.dbpedia.org/index.php/What%27s_in_a_Name#Language-specific_Names
Another interesting lang tag is "qqq-DZ" (meaning "language used in specific region: Algeria") in http://mappings.dbpedia.org/index.php?title=Mapping_fr:Infobox_Commune_d'Algérie&action=edit
I now see http://mappings.dbpedia.org/index.php/Template:PropertyMapping says: "we can define the language tag using the wikipedia language code".
But you should accept IANA lang tags not wikipedia codes, since the lang of a wikipedia does not limit the lang strings that it can contain. Eg frwiki talks about names in Serbian cyrillic (sr-Cyrl), Gagauz (gag), Algerian (which is not a single lang, ergo qqq-DZ) etc.
This is a nice addition but not sure what it might break in the framework. @jcsahnwaldt any ideas? There are some comments in the file [1] probably by you
[1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Language.scala
@jimregan: on first glance, we need to add to nonIsoCodes at https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Language.scala#L100 each of the language codes we dealth with at https://github.com/dbpedia/mappings-tracker/issues/15
But I'm not sure what are these codes used for:
- it makes sense to have a default lang code for each wiki, in order to mark rdf:langStrings with that code when PropertyMapping isn't given a "language"
- but why do we need a complete list of all languages that are used on all wikis? We'd have to update it every time we discover some weird dialect is used in a particular templateProperty
Ok, well that mapping needs to go. And never be mentioned again!
There are at least two problems with the current system:
- you need to manually add each and every language to the "non ISO" map, in order to avoid throwing an exception
- the map entry is then leveraged to turn the exact language into a "broader" one
So, even if you set language="gag" in the mapping, it will end up in the triples with an @ tr language tag, which may not be what you expected...