extraction-framework
extraction-framework copied to clipboard
removing parasitic prefix/suffix from raw props
Numbered raw props are collapsed to one prop:
http://fr.wikipedia.org/w/index.php?title=Antioche&action=edit:
| division = [[Région méditerranéenne]]
| nom de division = [[Régions de Turquie|Région]]
| division2 = [[Hatay]]
| nom de division2 = [[Provinces de Turquie|Province]]
| division3 = [[Région méditerranéenne]]
| nom de division3 = [[Districts de Turquie|District]]
Results in this on fr.dbpedia.org:
http://fr.dbpedia.org/property/nomDeDivision http://fr.dbpedia.org/resource/Provinces_de_Turquie
http://fr.dbpedia.org/property/nomDeDivision http://fr.dbpedia.org/resource/Régions_de_Turquie
http://fr.dbpedia.org/property/nomDeDivision http://fr.dbpedia.org/resource/Districts_de_Turquie
All three "nom de divisionN" are mapped to the same nomDeDivision.
This is the default behavior to avoid multiple property definitions. Here is the code that cleans the URIs: https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala#L278-290
But they have different meaning. Similarly-numbered props go in groups, so they should go into different IntermediateNodeMappings. If you colapse the props by wiping the numbers, wont' such distinctions be lost?
Politician templates are the worst, eg see http://mappings.dbpedia.org/index.php?title=Mapping_bg:Държавник_инфо&action=edit.
You have a different prop group for every nested position>mandate (eg 3*5=15), and the grouping is both by prop prefix and suffix, and it's not consistent.
Eg there are 10 props "предшестван", all mapped to "predecessor" but in different groups:
предшестван от
предшестван от2
предшестван от3
втори_мандат_предшестван от
втори_мандат_предшестван от2
втори_мандат_предшестван от3
трети_мандат_предшестван от
...
Which of them collapse, which of them you don't, why you do, and why you don't?
@jimkont Please reopen for further investigation
This does not affect the mappings extractor & intermediate nodes, only the raw infobox extractor data. In some cases it might make sense to keep the numbers but there are also many where it does not. I don't have an example handy but last time I checked (2-3 years ago) there were quite a few.
So I am reopening and suggest we do this like the mappings case and examine the diffs from both options
So when you write templateProperty=x
, that may be not really be dbprop:x
but a modification thereof?
I agree with these modifications, since the semantics of raw prop "parasitic" prefixes & suffixes is not transmitted clearly, it's better to tranmit them all as the "root" of the prop name
Oh this needs to be documented in a whole chapter... my head hurts.
Two questions about the modification:
- It drops suffix
\d+
- Should't it also drop suffix
\d+.*
? - Should't there be a configurable list of droppable prefixes? Eg
(първи|втори|трети) мандат
There's a misunderstanding...
This issue only affects properties in the http://dbpedia.org/property/
namespace, produced by InfoboxExtractor.
The mappings produce properties in the http://dbpedia.org/ontology/
namespace. See MappingExtractor. Completely different code.
When you write templateProperty=x
, you get exactly dbo:x
. No modification. You never get dbprop:x
- that's a completely different namespace.
Hope that clears things up. :-)
http://wiki.dbpedia.org/Downloads2014#mapping-based-properties
http://wiki.dbpedia.org/Downloads2014#raw-infobox-properties
http://wiki.dbpedia.org/Datasets#h434-10
P.S.: "When you write templateProperty=x, you get exactly dbo:x. No modification." - I think that's correct, but I'm not 100% sure, e.g. upper/lower case. Would have to check the code.
More precisely: You won't get dbo:x
. When you write templateProperty=x
in a mapping, it matches exactly x
in the Wikitext source, not x2
or anything else (check the code of SimplePropertyMapping.scala for details). Of course, you also specify ontologyProperty=y
in the mapping, so you will get http://dbpedia.org/ontology/y
, which is sometimes abbreviated as dbo:y
.
I think we can close this issue. It only affects the raw Infobox properties, which is basically a legacy dataset. In the last few years, we haven't put much work into it, and for good reason - the DBpedia wiki strongly recommends using the mapping based properties. They are much better.
I think you misunderstood most of what I wrote above (not your brilliant self today ;-)). I grok that when I write in an IntermediateNodeMapping
:
{{ PropertyMapping | templateProperty = втори_мандат_предшестван от3 | ontologyProperty = predecessor }}
It generates
bgdbr:Тодор_Живков bgdbp:вториМандатПредшестванОт <pred>. # raw: bgdbp: and dropped suffix
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>. # mapped: dbo: and in IntermediateNode
BUT I'm pleading it should generate
bgdbr:Тодор_Живков bgdbp:предшестванОт <pred>.
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>.
because the prefix втори_мандат_
is just as parasitic as the numeric suffix.
And also: the parasitic numeric-alphabetic suffix of предшестван от3a
should also be dropped.
Documented at http://mappings.dbpedia.org/index.php/Rewriting_templateProperty. @jcsahnwaldt could you please take a look and see if it's accurate?
Don't throw away the raw props! They're there even if there are no mappings, or the mappings are wrong (alas, they are often wrong, even in big dbpedias like fr). So there are many real-world queries that mix raw and mapped props.
Vladimir, please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor) use the default option that extracts only labels and mappings (let me know if this is not the case)
@VladimirAlexiev I checked out http://mappings.dbpedia.org/index.php/Rewriting_templateProperty . Nice page! But it's largely... how do I say it nicely... well, it's just wrong. Now I know where this misunderstanding is coming from, and why you dared question my authority. ;-) I added these lines to the page:
Here's what actually happens:
- Wikitext is parsed into an AST (abstract syntax tree)
- The AST is passed to several different extractors (according to configuration)
- Each extractor processes the AST and produces triples
- The triples are not used as input for any other extractor
Here's what the InfoboxExtractor does:
- data is extracted from template props in the AST
- these are emitted as language-specific '''raw''' props, eg
- http://dbpedia.org/property/parent for EN (usual prefix [http://prefix.cc/dbp dbp:])
- http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:]
Here's what the MappingExtractor does:
- data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty
- these are emitted as generic mapping-based props, eg
- http://dbpedia.org/ontology/parent for EN, BG and any other language (usual prefix dbo:)
In other words:
The InfoboxExtractor doesn't care about the mappings at all, processes all named properties and generates
bgdbr:Тодор_Живков bgdbp:вториМандатПредшестванОт <pred>. # raw: bgdbp: and dropped suffix
The MappingExtractor doesn't care at all about the property names produced by InfoboxExtractor, only extracts properties for which a mapping exists and generates
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>. # mapped: dbo: and in IntermediateNode
The two are completely independent. (Well, they both process the AST produced by the Wikitext parser, but that's it.)
I hope that clears up things. This issue has nothing to do with mappings.
But you raised a few good questions about the InfoboxExtractor:
- It drops suffix
\d+
That's correct. It also does a few more things. See InfoboxExtractor.getPropertyUri for details.
- Should't it also drop suffix
\d+.*
?
Probably not. I think there are some properties that contain a digit somewhere in the middle of their name. Something more specific would be better.
- Should't there be a configurable list of droppable prefixes? Eg
(първи|втори|трети) мандат
Sounds good! Some config values in the class InfoboxExtractorConfig are already language specific. Might be relatively easy to add a few more such configuration values and use them in InfoboxExtractor.getPropertyUri.
@jcsahnwaldt Thanks for the edits! I'll fix up that page. @jimkont "please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor)": is there any harm in that?
@jimkont https://github.com/jimkont "please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor)": is there any harm in that?
Yes, there is a limit in the extraction samples and in big articles might not cannot get all the expected triples just like the Elvis Prisley link you posted on the mailing lists. so if you want to test the mappings use only the default extractor, if you want to see what else DBpedia would produce use the custom but beware it might not be complete
Added your warning to http://mappings.dbpedia.org/index.php/Main_Page#Custom_or_Default_Extractor.
Now back to the topic: based on https://github.com/dbpedia/mappings-tracker/issues/51, I suggest to also remove suffixes are 1-2 digits followed by a single letter, i.e. match this:
[^0-9][0-9]{1,2}[:alpha:]$
I don't think that's what you want, it would also remove the character before the digits... looking for negative look-behind instead? i.e (?<![0-9])[0-9]{1,2}[a-z]?$
Anyway, I have always thought that the issue with "raw" props is more on the value side, since they're actually anything but raw. This just leads to newcomers seeing bugs everywhere just as in https://github.com/dbpedia/extraction-framework/issues/317. The problem is, throwing parsers blindly (as opposed to selecting the parser based on the ontology property type) at a prop does not always yield something meaningful.
That's why, when preparing for mapping, they may only give a shallow hint at the real template property content. And when trying to fix parser bugs, they are often of very little help...
Right about the regex. Did you write that guide? I think it's excellent; I added a bit and linked it to other pages I wrote. We need more best practices on specific topics (eg mapping Place Relations or Dimensions)