extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

dataprop extractor: language doesn't handle lang tag sr-Cyrl

Open VladimirAlexiev opened this issue 10 years ago • 7 comments

template: http://mappings.dbpedia.org/index.php/Template:PropertyMapping says:

  • language: if the datatype is of rdf:langString we can define the language of the language tag using the wikipedia language code (e.g. language = de) "datatype" is incorrect, should say "range"

property: http://mappings.dbpedia.org/index.php/OntologyProperty:Foaf:name

  • does have rdfs:range rdf:langString

mapping: http://mappings.dbpedia.org/index.php?title=Mapping_fr:Infobox_Ville_de_Serbie&action=edit has

{{PropertyMapping | templateProperty = nom | ontologyProperty = foaf:name | language = fr }}
{{PropertyMapping | templateProperty = nom_cyrillique | ontologyProperty = foaf:name | language = sr-Cyrl }}

wiki page: https://fr.wikipedia.org/w/index.php?title=Požega_(Serbie)&action=edit has

| nom_cyrillique           = Пожега

result: http://mappings.dbpedia.org/server/extraction/fr/extract?title=Požega_(Serbie)&revid=&format=turtle-triples&extractors=custom

  • has foaf:name "Požega"@fr
  • doesn't have foaf:name "Пожега"@sr-Cyrl

Maybe the dataprop extractor has the wrong idea what can a lang tag be? That above is a valid lang tag meaning "lang=Serbian, script=Cyrillic"

VladimirAlexiev avatar Jan 13 '15 07:01 VladimirAlexiev

This is critical, because we want to fix 10-15 lang-specific props to foaf:name with lang tag: http://mappings.dbpedia.org/index.php/What%27s_in_a_Name#Language-specific_Names

VladimirAlexiev avatar Jan 13 '15 07:01 VladimirAlexiev

Another interesting lang tag is "qqq-DZ" (meaning "language used in specific region: Algeria") in http://mappings.dbpedia.org/index.php?title=Mapping_fr:Infobox_Commune_d'Algérie&action=edit

VladimirAlexiev avatar Feb 15 '15 14:02 VladimirAlexiev

I now see http://mappings.dbpedia.org/index.php/Template:PropertyMapping says: "we can define the language tag using the wikipedia language code".

But you should accept IANA lang tags not wikipedia codes, since the lang of a wikipedia does not limit the lang strings that it can contain. Eg frwiki talks about names in Serbian cyrillic (sr-Cyrl), Gagauz (gag), Algerian (which is not a single lang, ergo qqq-DZ) etc.

VladimirAlexiev avatar Feb 15 '15 15:02 VladimirAlexiev

This is a nice addition but not sure what it might break in the framework. @jcsahnwaldt any ideas? There are some comments in the file [1] probably by you

[1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Language.scala

jimkont avatar Feb 16 '15 07:02 jimkont

@jimregan: on first glance, we need to add to nonIsoCodes at https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/Language.scala#L100 each of the language codes we dealth with at https://github.com/dbpedia/mappings-tracker/issues/15

But I'm not sure what are these codes used for:

  • it makes sense to have a default lang code for each wiki, in order to mark rdf:langStrings with that code when PropertyMapping isn't given a "language"
  • but why do we need a complete list of all languages that are used on all wikis? We'd have to update it every time we discover some weird dialect is used in a particular templateProperty

VladimirAlexiev avatar Feb 16 '15 10:02 VladimirAlexiev

Ok, well that mapping needs to go. And never be mentioned again!

jimregan avatar Feb 16 '15 12:02 jimregan

There are at least two problems with the current system:

  • you need to manually add each and every language to the "non ISO" map, in order to avoid throwing an exception
  • the map entry is then leveraged to turn the exact language into a "broader" one

So, even if you set language="gag" in the mapping, it will end up in the triples with an @ tr language tag, which may not be what you expected...

Nono314 avatar Mar 01 '15 17:03 Nono314