extraction-framework removing parasitic prefix/suffix from raw props

Numbered raw props are collapsed to one prop:

http://fr.wikipedia.org/w/index.php?title=Antioche&action=edit:

 | division                 = [[Région méditerranéenne]]
 | nom de division          = [[Régions de Turquie|Région]]
 | division2                = [[Hatay]]
 | nom de division2         = [[Provinces de Turquie|Province]]
 | division3                = [[Région méditerranéenne]]
 | nom de division3         = [[Districts de Turquie|District]]

Results in this on fr.dbpedia.org:

http://fr.dbpedia.org/property/nomDeDivision    http://fr.dbpedia.org/resource/Provinces_de_Turquie
http://fr.dbpedia.org/property/nomDeDivision    http://fr.dbpedia.org/resource/Régions_de_Turquie
http://fr.dbpedia.org/property/nomDeDivision    http://fr.dbpedia.org/resource/Districts_de_Turquie

All three "nom de divisionN" are mapped to the same nomDeDivision.

Jan 22 '15 11:01 VladimirAlexiev

This is the default behavior to avoid multiple property definitions. Here is the code that cleans the URIs: https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala#L278-290

Feb 25 '15 07:02 jimkont

But they have different meaning. Similarly-numbered props go in groups, so they should go into different IntermediateNodeMappings. If you colapse the props by wiping the numbers, wont' such distinctions be lost?

Politician templates are the worst, eg see http://mappings.dbpedia.org/index.php?title=Mapping_bg:Държавник_инфо&action=edit.

You have a different prop group for every nested position>mandate (eg 3*5=15), and the grouping is both by prop prefix and suffix, and it's not consistent.

Eg there are 10 props "предшестван", all mapped to "predecessor" but in different groups:

 предшестван от
 предшестван от2
 предшестван от3
 втори_мандат_предшестван от
 втори_мандат_предшестван от2
 втори_мандат_предшестван от3
 трети_мандат_предшестван от
  ...

Which of them collapse, which of them you don't, why you do, and why you don't?

Feb 25 '15 09:02 VladimirAlexiev

@jimkont Please reopen for further investigation

Feb 25 '15 10:02 VladimirAlexiev

This does not affect the mappings extractor & intermediate nodes, only the raw infobox extractor data. In some cases it might make sense to keep the numbers but there are also many where it does not. I don't have an example handy but last time I checked (2-3 years ago) there were quite a few.

So I am reopening and suggest we do this like the mappings case and examine the diffs from both options

Feb 25 '15 10:02 jimkont

So when you write templateProperty=x, that may be not really be dbprop:x but a modification thereof? I agree with these modifications, since the semantics of raw prop "parasitic" prefixes & suffixes is not transmitted clearly, it's better to tranmit them all as the "root" of the prop name

Oh this needs to be documented in a whole chapter... my head hurts.

Two questions about the modification:

It drops suffix \d+
Should't it also drop suffix \d+.* ?
Should't there be a configurable list of droppable prefixes? Eg (първи|втори|трети) мандат

Feb 25 '15 10:02 VladimirAlexiev

There's a misunderstanding...

This issue only affects properties in the http://dbpedia.org/property/ namespace, produced by InfoboxExtractor.

The mappings produce properties in the http://dbpedia.org/ontology/ namespace. See MappingExtractor. Completely different code.

When you write templateProperty=x, you get exactly dbo:x. No modification. You never get dbprop:x - that's a completely different namespace.

Hope that clears things up. :-)

Feb 25 '15 11:02 jcsahnwaldt

http://wiki.dbpedia.org/Downloads2014#mapping-based-properties

http://wiki.dbpedia.org/Downloads2014#raw-infobox-properties

http://wiki.dbpedia.org/Datasets#h434-10

Feb 25 '15 11:02 jcsahnwaldt

P.S.: "When you write templateProperty=x, you get exactly dbo:x. No modification." - I think that's correct, but I'm not 100% sure, e.g. upper/lower case. Would have to check the code.

Feb 25 '15 11:02 jcsahnwaldt

More precisely: You won't get dbo:x. When you write templateProperty=x in a mapping, it matches exactly x in the Wikitext source, not x2 or anything else (check the code of SimplePropertyMapping.scala for details). Of course, you also specify ontologyProperty=y in the mapping, so you will get http://dbpedia.org/ontology/y, which is sometimes abbreviated as dbo:y.

Feb 25 '15 11:02 jcsahnwaldt

I think we can close this issue. It only affects the raw Infobox properties, which is basically a legacy dataset. In the last few years, we haven't put much work into it, and for good reason - the DBpedia wiki strongly recommends using the mapping based properties. They are much better.

Feb 25 '15 11:02 jcsahnwaldt

I think you misunderstood most of what I wrote above (not your brilliant self today ;-)). I grok that when I write in an IntermediateNodeMapping:

{{ PropertyMapping | templateProperty = втори_мандат_предшестван от3 | ontologyProperty = predecessor }}

It generates

bgdbr:Тодор_Живков bgdbp:вториМандатПредшестванОт <pred>. # raw: bgdbp: and dropped suffix
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>. # mapped: dbo: and in IntermediateNode

BUT I'm pleading it should generate

bgdbr:Тодор_Живков bgdbp:предшестванОт <pred>.
bgdbr:Тодор_Живков__1 dbo:predecessor <pred>.

because the prefix втори_мандат_ is just as parasitic as the numeric suffix.

And also: the parasitic numeric-alphabetic suffix of предшестван от3a should also be dropped.

Documented at http://mappings.dbpedia.org/index.php/Rewriting_templateProperty. @jcsahnwaldt could you please take a look and see if it's accurate?

Don't throw away the raw props! They're there even if there are no mappings, or the mappings are wrong (alas, they are often wrong, even in big dbpedias like fr). So there are many real-world queries that mix raw and mapped props.

Feb 25 '15 12:02 VladimirAlexiev

Vladimir, please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor) use the default option that extracts only labels and mappings (let me know if this is not the case)

Feb 25 '15 14:02 jimkont

@VladimirAlexiev I checked out http://mappings.dbpedia.org/index.php/Rewriting_templateProperty . Nice page! But it's largely... how do I say it nicely... well, it's just wrong. Now I know where this misunderstanding is coming from, and why you dared question my authority. ;-) I added these lines to the page:

Here's what actually happens:

Wikitext is parsed into an AST (abstract syntax tree)
The AST is passed to several different extractors (according to configuration)
Each extractor processes the AST and produces triples
The triples are not used as input for any other extractor

Here's what the InfoboxExtractor does:

data is extracted from template props in the AST
these are emitted as language-specific '''raw''' props, eg
- http://dbpedia.org/property/parent for EN (usual prefix [http://prefix.cc/dbp dbp:])
- http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:]

Here's what the MappingExtractor does:

data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty
these are emitted as generic mapping-based props, eg
- http://dbpedia.org/ontology/parent for EN, BG and any other language (usual prefix dbo:)

Feb 25 '15 18:02 jcsahnwaldt

In other words:

The InfoboxExtractor doesn't care about the mappings at all, processes all named properties and generates

bgdbr:Тодор_Живков bgdbp:вториМандатПредшестванОт <pred>. # raw: bgdbp: and dropped suffix

The MappingExtractor doesn't care at all about the property names produced by InfoboxExtractor, only extracts properties for which a mapping exists and generates

bgdbr:Тодор_Живков__1 dbo:predecessor <pred>. # mapped: dbo: and in IntermediateNode

The two are completely independent. (Well, they both process the AST produced by the Wikitext parser, but that's it.)

Feb 25 '15 18:02 jcsahnwaldt

I hope that clears up things. This issue has nothing to do with mappings.

But you raised a few good questions about the InfoboxExtractor:

It drops suffix \d+

That's correct. It also does a few more things. See InfoboxExtractor.getPropertyUri for details.

Should't it also drop suffix \d+.* ?

Probably not. I think there are some properties that contain a digit somewhere in the middle of their name. Something more specific would be better.

Should't there be a configurable list of droppable prefixes? Eg (първи|втори|трети) мандат

Sounds good! Some config values in the class InfoboxExtractorConfig are already language specific. Might be relatively easy to add a few more such configuration values and use them in InfoboxExtractor.getPropertyUri.

Feb 25 '15 18:02 jcsahnwaldt

@jcsahnwaldt Thanks for the edits! I'll fix up that page. @jimkont "please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor)": is there any harm in that?

Mar 04 '15 09:03 VladimirAlexiev

@jimkont https://github.com/jimkont "please do not use the custom extractor when testing, this produces data from all available extractors (even infobox extractor)": is there any harm in that?

Yes, there is a limit in the extraction samples and in big articles might not cannot get all the expected triples just like the Elvis Prisley link you posted on the mailing lists. so if you want to test the mappings use only the default extractor, if you want to see what else DBpedia would produce use the custom but beware it might not be complete

Mar 04 '15 11:03 jimkont

Added your warning to http://mappings.dbpedia.org/index.php/Main_Page#Custom_or_Default_Extractor.

Now back to the topic: based on https://github.com/dbpedia/mappings-tracker/issues/51, I suggest to also remove suffixes are 1-2 digits followed by a single letter, i.e. match this: [^0-9][0-9]{1,2}[:alpha:]$

Mar 06 '15 13:03 VladimirAlexiev

I don't think that's what you want, it would also remove the character before the digits... looking for negative look-behind instead? i.e (?<![0-9])[0-9]{1,2}[a-z]?$

Anyway, I have always thought that the issue with "raw" props is more on the value side, since they're actually anything but raw. This just leads to newcomers seeing bugs everywhere just as in https://github.com/dbpedia/extraction-framework/issues/317. The problem is, throwing parsers blindly (as opposed to selecting the parser based on the ontology property type) at a prop does not always yield something meaningful.

That's why, when preparing for mapping, they may only give a shallow hint at the real template property content. And when trying to fix parser bugs, they are often of very little help...

Mar 06 '15 18:03 Nono314

Right about the regex. Did you write that guide? I think it's excellent; I added a bit and linked it to other pages I wrote. We need more best practices on specific topics (eg mapping Place Relations or Dimensions)

Mar 21 '15 11:03 VladimirAlexiev

extraction-framework extraction-framework copied to clipboard

removing parasitic prefix/suffix from raw props

extraction-framework
extraction-framework copied to clipboard