metafacture-core icon indicating copy to clipboard operation
metafacture-core copied to clipboard

Add SKOS lookup function in Fix

Open TobiasNx opened this issue 2 years ago • 33 comments

In the Destatis-Fächerklassifikation Vocab there are now english prefLabels and in order to add them with metamorph/fix we need to use different mapping files for each language in order to get the prefLabels we want like https://gitlab.com/oersi/oersi-etl/-/blob/master/data/maps/subject-labels.tsv For an english version we would need an additional list, that would need to be cared about.

But since we have a ScoHub Vocabs/Skos-‘ttl‘-files it would be nice to use them as lookup so that we do not need to create and update additional lists.

For the lookup should ttl file should be the target: e.g.: https://github.com/dini-ag-kim/hochschulfaechersystematik/blob/master/hochschulfaechersystematik.ttl (Other skos serialization could follow)

Nice would be something like the following with mock code:


@base <https://w3id.org/kim/hochschulfaechersystematik/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix vann: <http://purl.org/vocab/vann/> .

...

<n4> a skos:Concept ;
  skos:prefLabel "Mathematik, Naturwissenschaften"@de, "Mathematics, Natural Sciences"@en ;
  skos:narrower   <n36>, <n37>, <n39>, <n40>, <n41>, <n42>, <n43>, <n44> ;
  skos:notation "4" ;
  skos:topConceptOf <scheme> .

...

Idea for Fix function:

skos_lookup("element-path" ,file="[path/url]", 
[match="attribute that should be matching", matchLanguage="language of the replaced value"], 
target="attribute to be replace with", targetLanguage="language of replacing value")

file= could be a URL or a local file, match= is default id match= and matchLanguage= are optional target= and targetLanguage= are always needed

Use case 1:

Find matching subject and return object of targeted predicate.

in: https://w3id.org/kim/hochschulfaechersystematik/n4

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", target="prefLabel", targetLanguage="de")

out: Mathematik, Naturwissenschaften

Use case 2:

Find matching object value in selected predicate and return its subject.

in: Mathematics, Natural Sciences

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl",match="prefLabel", matchLanguage="en", target="id")

out: https://w3id.org/kim/hochschulfaechersystematik/n4

Use case 3:

Find matching object value in selected predicate and return object of targeted and connected predicate. This could be also interesting if we have SKOS files with hiddenLabels or altLabels.

in: Mathematics, Natural Sciences

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", match="prefLabel", matchLanguage="en", target="prefLabel", targetLanguage="de")

out: Mathematik, Naturwissenschaften

Code review: @fsteeg Functional review: @TobiasNx @acka47

TobiasNx avatar Nov 03 '21 14:11 TobiasNx

In today's meeting we decided to:

  1. work on a concrete use case (from RPB, @acka47 will provide it)
  2. focus on an implementation for Metafix
  3. support different RDF serializations
  4. take into account that we would have to address the considerations below at a later point without breaking the implementation

Further considerations:

  • What about concept schemes that are not available as single files, where you have to traverse the graph from ConceptScheme over hasTopConcept and narrower to get all the data for lookup function? (E.g. SkoHub Vocabs doesn't require you to have the ConceptScheme in one file, you could also have one file per Concept.)
  • For use cases I know from @sroertgen one would need to specify different fields (e.g. prefLabel, altLabel, hiddenLabel) to be taken into account for matching.

acka47 avatar Jun 02 '22 11:06 acka47

As required, here is my use case.

In RPB data we only have notations for RPB subject, e.g. #30 _sn584060_[/]#30a_sn584070_, see here.

I can create the correct concept URI with Fix, resulting in:

{
   "subject":[
      {
         "id":"http://purl.org/lobid/rpb#n584060",
         "label":"Platzhalter Schlagwortlabel",
         "type":[
            "Concept"
         ],
         "source":{
            "id":"http://purl.org/lobid/rpb",
            "label":"Systematik der Rheinland-Pfälzischen Bibliographie"
         }
      },
      {
         "id":"http://purl.org/lobid/rpb#n584070",
         "label":"Platzhalter Schlagwortlabel",
         "type":[
            "Concept"
         ],
         "source":{
            "id":"http://purl.org/lobid/rpb",
            "label":"Systematik der Rheinland-Pfälzischen Bibliographie"
         }
      }
   ]
}

As you can see, for the label I added a generic "Platzhalter Schlagwortlabel" for now as I can not (yet) lookup labels in a SKOS file. I'd be happy to in the future do something like this in the fix:

add_field("label", lookup:"prefLabel@de", in:"http://purl.org/lobid/rpb", basedOn:"id", match:"${s}")

Where I basically specify what content should be added to the new "label" field by indicating:

  • which string to add, here the skos:prefLabel of the matched resource with language code "de"
  • the ConceptScheme to do the lookup in, here http://purl.org/lobid/rpb
  • the source field from my data to do the lookup with, here id
  • the field(s) to look for a match, here it is the subject URI in the RDF: ${s}

acka47 avatar Jun 02 '22 14:06 acka47

add_field("label", lookup:"prefLabel@de", in:"http://purl.org/lobid/rpb", basedOn:"id", match:"${s}")

In general, I think we should implement this like the existing lookup, so something like:

lookup("subject.label", "rpb.ttl", someOption: ..., anotherOption2: ...)

fsteeg avatar Jun 02 '22 14:06 fsteeg

In general, I think we should implement this like the existing lookup, so something like:

lookup("subject.label", "rpb.ttl", someOption: ..., anotherOption2: ...)

Are you sure that lookup() should be overloaded? Catmandu has lookup_in_store(), so I'd suggest either modeling it as a store and implementing that function or just naming it accordingly (lookup_in_rdf()). Otherwise, rdf_lookup() might be an acceptable name.

blackwinter avatar Jun 02 '22 14:06 blackwinter

My view would be that essentially, we want to support one additional file format, TTL, in addition to CSV and TSV.

Since we'd probably implement this based on an RDF model anyway, we might as well support other SKOS RDF serializations (though I'm not even sure I like that idea, I'd prefer to stick to actual use cases, and we use TTL files). But a generic RDF lookup would be quite a different thing. For that, something like lookup_in_store (and then using something like a triple store) might make more sense.

fsteeg avatar Jun 02 '22 15:06 fsteeg

My view would be that essentially, we want to support one additional file format, TTL, in addition to CSV and TSV.

In principle, yes, but lookup() is specifically meant for dictionaries. And an RDF file, whatever its serialization, is conceptually quite different from a simple delimited file with key-value pairs.

But I'm unsure myself. I just think we might regret it if we overwhelmed lookup() with too many features.

[Come to think of it, maybe we shouldn't even have added local maps to it. lookup_in_map() or lookup_in_store(..., Memory) might have been more appropriate.]

blackwinter avatar Jun 02 '22 15:06 blackwinter

I agree to @blackwinter - while it's in principle possible to make a Map out of RDF files, it may get complicated. And since there are Further considerations it may be better to go with an RDF store from the beginning. As I am not a fan of external databases (brings complexity) and our scenarios make use of only little data I would start with an in-memory RDF store/model.

dr0i avatar Jun 02 '22 15:06 dr0i

I don't think it helps to talk about RDF here. Spreadsheets are also much more powerful than simple dictionary lookups, yet we don't have generic spreadsheet support, we only use TSV or CSV files as simple dictionaries. Same is our plan for SKOS as I understand it: we want to use it as a simple dictionary.

fsteeg avatar Jun 03 '22 09:06 fsteeg

Hm, but if you look at the scenarios @TobiasNx provided - these are not simple dictionaries? I mean, yeah, you can all things break somehow down to key-value structures, but they may not fit all purposes, e.g. "give me A, but A shall not have B and must be of Concept C". See also Semantic Reasoner. I mean, it's about skos lookup, so naturally RDF? Maybe you can tell what's your problem with RDF? One obvious drawback is the need of heavy dependencies (going with apache jena), which on the other hand provides parsing of all kinds of RDF serializations, merging and querying in an easy and standardized way. So we could maybe provide in metafacture-fix some kind of modules like in metafacture-core?

dr0i avatar Jun 03 '22 10:06 dr0i

Maybe you can tell what's your problem with RDF? One obvious drawback is the need of heavy dependencies (going with apache jena)

No problem with RDF, and I even imagined to implement this based on an RDF model, using Jena. My point is how this will be used. I think it should provide a simple way to look up values in a SKOS-TTL instead of a TSV or CSV. It should not require dealing with RDF concepts. Something like lookup(field, 'rpb.ttl') should provide a prefLabel for a concept ID. Lookup options as described above could be configured, but I think it should be that simple to use for the basic use case.

Another option in my point of view would be to add support for reading RDF data in Metafacture. We could then write a small 'preprocessing' workflow that transforms the RDF data into a lookup TSV and use that, instead of adding lookup support for SKOS.

fsteeg avatar Jun 03 '22 12:06 fsteeg

I want to hint to one advantage of an genuine SKOS lookup we can use one ttl-file for multiple lookups instead of generating multiple tsv files be it automated or manually.

In OERSI we have e.g.:

lookup("learningResourceType[].*.prefLabel.de", "data/maps/hcrt-de-labels.tsv","sep_char":"\t", delete:"true")
lookup("learningResourceType[].*.prefLabel.en", "data/maps/hcrt-en-labels.tsv","sep_char":"\t", delete:"true")

with some kind of SKOS-lookup this could be:


skos_lookup("learningResourceType[].*.prefLabel.de", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", target="prefLabel", targetLanguage="de" )
skos_lookup("learningResourceType[].*.prefLabel.de", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", target="prefLabel", targetLanguage="en" )

TobiasNx avatar Jun 03 '22 12:06 TobiasNx

I want to hint to one advantage of an genuine SKOS lookup we can use one ttl-file for multiple lookups instead of generating multiple tsv files be it automated or manually

I agree with this statement as long as you say "one ConceptScheme" instead of "one ttl-file". As noted before, even with SkoHub Vocabs one Concept scheme can be spread over many files, which totally makes sense when you have a big vocab.

acka47 avatar Jun 03 '22 12:06 acka47

I have updated the initial post so that the function are fix now: https://github.com/metafacture/metafacture-core/issues/415#issue-1043693019

I also gave an idea of the function I had in mind: Idea for Fix function i had to solve this:

skos_lookup("element-path" ,file="[path/url]", 
[match="attribute that should be matching", matchLanguage="language of the replaced value"], 
target="attribute to be replace with", targetLanguage="language of replacing value")

file= could be a URL or a local file, match= is default id match= and matchLanguage= are optional target= and targetLanguage= are always needed

TobiasNx avatar Jun 03 '22 12:06 TobiasNx

target= and targetLanguage= are always needed

Wouldn't it make sense to use prefLabel as a default, and make target optional?

And are language tags required in SKOS? Even if they were, I think having a default like 'If there is only one language, use that if no target language is given' would be nice.

fsteeg avatar Jun 07 '22 06:06 fsteeg

And are language tags required in SKOS? Even if they were, I think having a default like 'If there is only one language, use that if no target language is given' would be nice.

Languages tags in SKOS are optional:

As specified in Section 5 of the SKOS Reference, skos:prefLabel, skos:altLabel and skos:hiddenLabel provide simple labels. They are all sub-properties of rdfs:label, and are used to link a skos:Concept to an RDF plain literal, which is a character string (e.g. "love") combined with an optional language tag (e.g. "en-US") [RDF-CONCEPTS].

source: https://www.w3.org/TR/2009/NOTE-skos-primer-20090818/#seclabel

sroertgen avatar Jun 10 '22 09:06 sroertgen

@sroertgen as you are here: I know you extensively use SKOS files for normalizing data in an ETL process. Are your use cases adressed in this issue or do you see something we should keep in mind?

acka47 avatar Jun 10 '22 10:06 acka47

Yes, that is pretty much what we did in WLO. We used prefLabel, altLabel and hiddenLabel for data normalization and then assigned the id of the respective matching concept.

sroertgen avatar Jun 13 '22 12:06 sroertgen

I don't get how the lookup_rdf is intended to work when I look at the current documentation. I want to use a URIs from the field id for the lookup and to write the skos:prefLabel that comes back from the lookup into label. How should I express this?

Something like this wouldn't work: add_field("label", lookup_rdf("id", "rpb.ttl", target: "http://www.w3.org/2004/02/skos/core#prefLabel"))"

Show me the way...

acka47 avatar Aug 26 '22 11:08 acka47

With the help of @TobiasNx I think I now understood how it intended to work. I did the following:

copy_field("id", "label")
lookup_rdf("label", "rpb.ttl", target: "http://www.w3.org/2004/02/skos/core#prefLabel", target_language: "de")

and (without the target_language):

    copy_field("id", "label")
    lookup_rdf("label", "https://raw.githubusercontent.com/hbz/lobid-vocabs/master/rpb/rpb.ttl", target: "http://www.w3.org/2004/02/skos/core#prefLabel")

However, in the output I don't get the prefLabel but:

    "id" : "http://purl.org/lobid/rpb#n142560",
    "label" : "__default",

I double-checked and the looked up URI definitely exists in the SKOS file along a prefLabel. I also tried:

copy_field("id", "label")
lookup_rdf("label", "https://raw.githubusercontent.com/hbz/lobid-vocabs/master/rpb/rpb.ttl", target: "http://www.w3.org/2004/02/skos/core#prefLabel", target_language: "de")

acka47 avatar Aug 26 '22 12:08 acka47

See the branch I am trying things out at https://github.com/hbz/rpb/tree/rpb-11 and the commit at https://github.com/hbz/rpb/commit/ea6d31849992a75c55ebf378d17bc1b51abd2e83

acka47 avatar Aug 26 '22 12:08 acka47

You mean "Use case 1" There is a test scenario for this.

Seems to be a bug when using whole URI. Using namespace 'skos:prefLabel' instead of URI works. Going to fix this.

dr0i avatar Aug 26 '22 14:08 dr0i

Thanks for the pointer. So I tested it with skos:prefLabel instead of the URI and it works with both these scenarios.

  1. ✅ Use a local file for lookup:
    copy_field("id", "label")
    lookup_rdf("label", "rpb.ttl", target: "skos:prefLabel")
  1. ✅ Use a remote file directly pointing to it:
    copy_field("id", "label")
    lookup_rdf("label", "https://raw.githubusercontent.com/hbz/lobid-vocabs/master/rpb/rpb.ttl", target: "skos:prefLabel")

❌ However, the third, very common use case does not work when I use the actual vocabulary URI that redirects to the file URL:

    copy_field("id", "label")
    lookup_rdf("label", "http://purl.org/lobid/rpb", target: "skos:prefLabel")

I think redirects should be supported, so there remain two to dos:

  • [x] Support using the absolute property URI (e.g. http://www.w3.org/2004/02/skos/core#prefLabel) as target.
  • [x] Support redirects when indicating the URI of a RDF file.

acka47 avatar Aug 29 '22 06:08 acka47

We just discussed this in the standup: We might only be working with relatively small SKOS vocabs, but there exist much bigger ones, e.g. AgroVoc with a ~850MB NT file (see here).

We might test it ourselves with a bigger vocab or at least should add in the documentation how much memory you need for what kind of files sizes.

acka47 avatar Sep 06 '22 10:09 acka47

  • implemented redirects (as discussed offline - this would not be necessary if we could use the newest jena-core library, but we would need to update MF to java 11 first).
  • fixed using URIs (no namespaces)

dr0i avatar Sep 06 '22 14:09 dr0i

re agrovoc and Memory consumption: tested the ~850MB NT file to lookup one value. This consumed ~4.5 GB RAM. The functionality is programmed like this: load the RDF into a jena model, make lookups (defined by metafix) and store successfully lookups into a HashMap for fast caching. Note that as more values are successfully lookuped, and thus cached, the more RAM will be consumed. In such a case one could add a trigger to reduce RAM consumption by avoiding caching at the expense of CPU (I assume this will be much slower, but don't know exactly how much (if at all). One might roughly say: "RAM consumption at least "$sizeOfNtFile * 5 + $amountOfSuccessfullyLookups * ($UriBytes + $ValueBytes + 100) " (where "100" is the needed Bytes for one simple HashMap entry (don't nail me on that)). So if you would have 100k different agrovoc lookups for "notation" , and assumed that the $UriBytes + $ValueBytes are about 110 Bytes, this would make: 4.5GB + (100k * 210) <=> 4.5GB + 21MB. (which would be very nice. Let 21MB=210MB, you wouldn't notice). Conclusion: Memory consumption depends on various factors, but one should be able to use agrovoc with a decent amount of lookups on a normal desktop pc. I would recommended to discuss this and possible solutions if the need arise (as always).

dr0i avatar Sep 06 '22 16:09 dr0i

I tested this once again and both lookup by absolute property URI and loading vocab file through a 301 redirect now work. Concretely, I tried this with success:

copy_field("id", "label")
lookup_rdf("label", "http://purl.org/lobid/rpb", target: "http://www.w3.org/2004/02/skos/core#prefLabel", target_language: "de")

Thus, +1 from my side.

acka47 avatar Sep 09 '22 07:09 acka47

https://github.com/metafacture/metafacture-fix/commit/f2f26654dc946a10cdd63b1f70b9a24edf79e1c0

I added integration tests to specify my usecases and it seems that only Subject -> Property (Use case 1) seems to work as transformation but has an default behaviour that is different to other lookups: it provides default values if the value is not found. This should be optional.

Property -> Subject seems not to work, or I did something wrong. (Usecase 2) Defined Property -> Subject also does not work. This is a variation of Usecase 2. https://github.com/metafacture/metafacture-fix/blob/f2f26654dc946a10cdd63b1f70b9a24edf79e1c0/metafix/src/test/resources/org/metafacture/metafix/integration/lookup/fromJson/toJson/lookupRdfDefinedPropertyToSubject

behaviour that is different to other lookups that it provides default values if the value is not found.

Property -> Property seems not to work, or I did something wrong. Usecase 3 Defined Property -> Property also does not work. This is a variation of Usecase 3. https://github.com/metafacture/metafacture-fix/blob/f2f26654dc946a10cdd63b1f70b9a24edf79e1c0/metafix/src/test/resources/org/metafacture/metafix/integration/lookup/fromJson/toJson/lookupRdfDefinedPropertyToProperty

@dr0i does this help?

TobiasNx avatar Sep 29 '22 16:09 TobiasNx

Fixed a bug and adapted some provided Fix so more of your use cases work now (see https://github.com/metafacture/metafacture-fix/pull/229). I am not sure what the difference between lookupRdfDefinedPropertyToProperty and lookupRdfDefinedPropertyToSubject is (I don't see any).

The lookupRdfDefinedPropertyToSubject cannot work as it is - a new parameter would be needed for this. The input :

{
  name : Jake,
  a : Softwareanwendung
}
[...]
{
  name : Noone_2,
  a : Assessment
}

substitutes , as intended, Softwareanwendung with https://w3id.org/kim/hcrt/application. However, the Assessment has not the language tag de , so it tries to fallback to get the German prefLabel, i.e. Lernkontrolle. While I think this is rather an edge case (who ever wants to get URIs restricted to language tags of the Objects of the Property (if you would omit the language tag you would get the URI https://w3id.org/kim/hcrt/assessment)?), this might be a valid scenario. As mentioned, this would onlybe possible if a new parameter is introduced, e.g. get: "subject" (and "object" as alternative). You are all right with that new parameter @TobiasNx (I would like to make that parameter mandatory)?

dr0i avatar Oct 17 '22 12:10 dr0i

@dr0i Did you have a look at the expected file?

https://github.com/metafacture/metafacture-fix/commit/678ca6209370ff75baa9ede62af46d43d36c7b38#diff-dd875d4ae1fe830bf0266806e23c55bd79ce68b7cbefbdf7b3dab02d57807627

It is intended that Assesment does not match! This is the difference between an arbitrary key matching as long as the value is provided (that is provided by the current SKOS-lookup) and an defined key matching where i determine which subject or property in the skos is the key-element. I intend to narrow the matching key and not compare all properties of an skos concept.

Input:

{
  "name" : "Jake",
  "a" : "Softwareanwendung"
}
{
  "name" : "Blacky",
  "a" : "Nachschlagewerk"
}
{
  "name" : "Noone",
  "a" : "cat"
}
{
  "name" : "Noone_2",
  "a" : "Assessment"
}

Expected:

{
  "name" : "Jake",
  "a" : "https://w3id.org/kim/hcrt/application"
}
{
  "name" : "Blacky",
  "a" : "https://w3id.org/kim/hcrt/index"
}
{
  "name" : "Noone",
  "a" : "cat"
}
{
  "name" : "Noone_2",
  "a" : "Assessment"
}

The difference between lookupRdfDefinedPropertyToProperty and lookupRdfDefinedPropertyToSubject is the difference between Use case 3 and Use case 2

TobiasNx avatar Oct 17 '22 13:10 TobiasNx

While I think this is rather an edge case (who ever wants to get URIs restricted to language tags of the Objects of the Property (if you would omit the language tag you would get the URI https://w3id.org/kim/hcrt/assessment)?), this might be a valid scenario. As mentioned, this would onlybe possible if a new parameter is introduced, e.g. get: "subject" (and "object" as alternative). You are all right with that new parameter @TobiasNx (I would like to make that parameter mandatory)?

The options were what I had in mind when I created the first mock code for this funciton: match="prefLabel", matchLanguage="en":

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", match="prefLabel", matchLanguage="en", target="prefLabel", targetLanguage="de")

I tell the function WHERE to look with (match="prefLabel", matchLanguage="en") instead of telling the function just to look if anything is matching.

TobiasNx avatar Oct 17 '22 14:10 TobiasNx