Question about IRI directives
I was looking at documenting the some directives (mm:CamelCaseEncode etc.) and discovered I was unsure about the use of the directives (apart from mm:IRI).
https://github.com/protegeproject/mapping-master/wiki/MappingMasterDSL#iris
Apart from the mm:IRI directive, can these directives be used for generating OWL entity names?
What are the rules for camel and snake case encoding? Which characters are removed when generating a value?
As per our other discussion (#70), in the case of mm:IRI, mm:CamelCaseEncode, and mm:SnakeCaseEncode is the raw value first tested to see if it is a valid IRI and if not the base ontology namespace it prepended to it?
After looking a bit more, I think the directives are for IRI encoding only.
If so, should we rename them to indicate this fact, e.g., mm:IRICamelCaseEncode, mm:IRISnakeCaseEncode, mm:IRIHashEncode etc?
If they are used only for IRIs then I think the names should reflect that.
We can do that but I think it's not necessary. The mm:...Encode is used to preprocess the input string to create a valid local name for the IRI.
The role of mm:IRI is pretty unique though. In the grammar spec it is defined as one of the ReferenceType. It is not part of the Declaration symbol (like CLASS or INDIVIDUAL) that you can use to create an entity. Instead it is used to set which form of OWL object will be applied to the reference's value.
Furthermore, I'd like to propose to rename the mm:IRI as IRI to convey the notion "to represent an OWLObject" and be consistent with the other symbols, such that in the grammar it becomes:
TOKEN: { <OWL_IRI: "IRI" > }
Also it is unique because (cmiiw) it only becomes visible for defining an IRI-typed annotation value. All entities are implicitly named using IRI.
I don't think an IRI necessarily represents an OWL entity. In the OWL spec, an IRI is just an IRI. It does not have to refer to an OWL entity.
However, as you said, mm:IRI is special in that it is declaring a type for a reference (as with Individual, xsd:int etc.) so one could imagine removing the preceding mm: for consistency. @csnyulas, what do you think?
I would recommend adding the IRI to the other IRI directives though to make their use explicit. For example, a user could easily type
@A3(rdfs:label=mm:CamelCaseEncode)
which I don't think makes sense.
I thought about this quite a lot, and I had also a discussion with @johardi and here is a summary of my thoughts:
First of all I think that it does indeed make sense to remove the mm: prefix from mm:IRI, as IRI represents the type of the reference, similarly to Individual or DataProperty (things pointed out also by @martinjoconnor and @johardi ), and not a directive or function that allows manipulation of the cell value, as it is with the other keywords starting with mm:. I remember though that there was (something that seemed a good) reason for adding the mm: prefix, when we decided for it last year, but I can't recall all the details.
This also brings up the question if the newly proposed mm:EntityIRI does make sense, and weather it needs some reconsideration. I would say that it does, as it is a MM function/directive, but it is in a way special, because it specifies both the type of the reference (IRI) and how to generated the value of that IRI, namely from an entity. Also we need to revisit other MM keywords, to see if it wouldn't be better to use the mm:entityIRI spelling.
I also second @martinjoconnor 's comments of not using OWL_IRI in the grammar
TOKEN: { <OWL_IRI: "IRI" > }
but rather just IRI (or at most something that would refer to the fact that is a type of a reference):
TOKEN: { <IRI: "IRI" > }
I don't think is necessary to add the IRI to the IRI encoding directives (mm:camelCaseEncode, mm:hashEncode, etc.). Those are functions, whose purpose should be clear, if properly documented. I can see that having them in the name, would make their purpose and usage even more clear, but it would make them also longer, and therefore harder to remember and type, and would break existing rules (see my comments on this at the end).
I also think that the IRI encoding directives should be documented independently from the IRI type (i.e. in a separate table or section of the MM DSL wiki page)
We should keep our non-backward compatible changes to the minimum, so that we don't break our users' existing mappings too badly. I.e. keep to to things that they can easily and quickly fix, so that they won't get frustrated. :)
I agree that the mm: prefix should be removed in this case. However, there should be a high threshold for such removal because these raw terms now become keywords in the language. In this case one could not now use 'IRI' as the name an OWL entity - and existing ontologies that did would not work with MM. (Once could write the language processor to allow keywords as entity names but it would massively complicate the processor. There is a good reason that most languages do not allow keywords as variable names.) MM was careful to prefix mm: to all language directives. (And the ability to change prefixes could be used to work around the cases where collisions with existing entity names could occur.)
mm:EntityIRI is indeed problematic here, however. We should think about it a bit more. The most MM-conformant approach here would be to use the IRI type together with some directive in a reference to indicate the modified processing.
Agree that we should be consistent with case in the naming of directives.
Still favour the adding of 'iri' to IRI directive names because they are quite specialized and apply only to IRIs. They could easily be seen as general purpose.
I will take a stab at completing the documentation of these by the end of the current milestone.
Thinking about this a bit more there are now 6 distinct and mutually exclusive directives that could apply to a reference that is typed as an IRI:
(1) no encoding (2) snake case encoding (3) camel case encoding (4) hash encoding (5) UUID encoding (6) entity IRI
I propose the following:
(1) @A4(IRI mm:rawIRI) -- the default (2) @A4(IRI mm:camelCaseIRI) (3) @A4(IRI mm:snakeCaseIRI) (4) @A4(IRI mm:hashIRI) (5) @A4(IRI mm:uuidIRI) (6) @A4(IRI mm:entityIRI) -- resolve to OWL entity; IRI of entity used as is
These are relative short, there is naming consistency, and the directives clearly apply only to IRIs.
Martin
I quite like it. I need some clarification though: are you suggesting that we would use these directives also for other entity names, such as classes and individuals, as a replacement of mm:...Encode? For example like this:
Class: @A5(mm:hashIRI)
If yes, I think for OWL entities we may need to use a different default than mm:rawIRI. At the moment I think the default IRI encoding for OWL entity is mm:camelCaseEncode (because that seem to be the best for the average use case), but as soon as we provide the user a way to set defaults for the IRI encodings through the UI (and save those), we plan to use mm:noEncode as the default (which would be equivalent to your mm:rawIRI, I guess) and let the user explicitly overwrite this default, to let's say mm:camelCaseEncode (see issue #68). So, I guess, ultimately the mm:rawIRI default would work for both IRIs and OWL entities, but it would be probably necessary to let the user specify different default IRI encodings, for IRIs (which he would probably keep it mm:rawIRI, as it makes more sense) and OWL entities (where he may want to choose between the several other options). I could even see that the user may want different IRI encoding defaults for T-Box entities (classes, properties) and A-Box entities (individuals).
@csnyulas and I had a longish chat about this today. Some initial conclusions:
We should aim for an MM 2.0 language that has some minor incompatibilities with the previous version of MM. We will document those changes on release.
Perhaps we should start all directives with a lower case letter to be consistent?
(1) We should definitely change mm:IRI to IRI as per discussion above.
(2) Similarly we should change mm:Literal to Literal, as it also indicates a type (of annotation property value).
(3) We should use the directives mm:rawIRI, mm:camelCaseIRI, mm:snakeCaseIRI, mm:hashIRI, and mm:uuidIRI as per above. We should recognize, however, that these directives also apply to OWL entity IRIs - not just IRI-typed references.
(4) Because of (3) we no longer need the rdf:ID directive. This is now implied for an OWL entity-typed reference by any of the 5 IRI directives above.
(5) We should also use the mm:entityIRIdirective but recognize that it is only meaningful for IRI-typed references. It is redundant for OWL entity-typed references.
(6) The rdfs:label directive is retained - and makes sense only for OWL entities. It specifies that the resolved reference value is used as the label of the created/resolved entity. If an rdfs:label directive is used in a reference then the resulting IRI is (a) taken from the resolved OWL entity if it exists, or (b) is created as per an IRI directive, with mm:uuidIRI as the default.
(7) We will introduce mm:defaultIRIEncoding to control the default encoding of OWL entity- and IRI-typed references. The default should probably be mm:rawIRI.
(8) We will introduce mm:defaultEntityIRIEncoding that controls IRI encoding for OWL entity-typed references. It overrides mm:defaultIRIEncoding for OWL entity-typed references.
(9) Similarly, we will introduce mm:defaultIndividualIRIEncoding to control IRI encoding for OWL individual-typed references. It overrides mm:defaultEntityIRIEncoding for OWL individual-typed references.
(10) We will introduce mm:defaultAnnotationPropertyValueType to complement the existing mm:defaultPropertyValueType and mm:defaultDataPropertyValueType. The default value will be Literal (see (2)).
(11) We will kill mm:defaultValueEncoding because the rdfs:label and IRI directives render it redundant and confusing.
(12) We may introduce an mm:locationIRI directive to construct IRIs based on the resolved location of the cell in the reference. e.g., an IRI could be http://example.com#MM_A4 for a reference resolved with location A4.
Thank you @martinjoconnor for making such an excellent summary of our earlier discussion. I took the liberty of making some minor changes to your comment, in order to keep things precise and tidy:
- I renamed
rdfs:Labeltordfs:labelthroughout the comment (specifically in item 6 and 11), because this is the proper spelling of the RDFS property. I hope this is fine, and it was just a typo (the grammar uses rdfs:label). We need to make sure that our documentation also uses the correct spelling. - item (12) changed
mm:iriLocationtomm:locationIRI
Sounds good. In the next week I will take a stab at cloning the existing DSL page and update to reflect a 2.0 version of MM. We can then switch the pointer to this when we release (and deprecate the current version).
Regarding point (4)
(4) Because of (3) we no longer need the
rdf:IDdirective. This is now implied for an OWL entity-typed reference by any of the 5 IRI directives above.
I think that we do need to keep the rdf:ID directive. At least to support the use case where the user can specify an arbitrary id schema for her individuals (using for example a certain combination of multiple cell values, that uniquely describe the identity of an individual).
Don't follow. Can you give an example?
Yes. In fact I was planning to point to an email from the mailing list, which I am working on at the moment, where I encountered a situation where I needed to do create a meaningful id using information from multiple columns. Will send the reference to the email soon, but here is a quick preview for now:
Individual: @A*
Types: Student
Facts: hasNote @**(rdf:ID=(@A*,"_",@*1,"_",@B*,"_" ,mm:replaceAll(@E*,"[-:]","")))
Could this not also be the following?
Individual: @A*
Types: Student
Facts: hasNote @**(mm:rawIRI=(@A*,"_",@*1,"_",@B*,"_" ,mm:replaceAll(@E*,"[-:]","")))
Both rdfs:label and mm:rawIRI can be followed by assignment operators if customization is required.
It is also possible that mm:rawIRI and rdf:ID could be treated as synonyms, which would be a nice backwards compatibility thing.
People are still likely to read the ISWC and OWLED papers - which use rdf:ID - and maintaining continuity with the examples in those papers might be a good idea.
Maybe this would work. However this changes a little bit what I understood so far regarding the meaning and use of mm:rawIRI. Until now I thought that:
a) the different mm:...IRI directives do not take "arguments", just like it would not make much sense to have mm:uuidIRI=... (but following your logic, we could maybe imagine something like mm:camelCaseIRI=...), and
b) mm:rawIRI is a replacement for mm:noEncode, as it takes the value of the cell and uses it directly as the IRI generated from the reference.
Again, I should have documented this somewhere, but many - but not all - directives have a default implicit parameter which is the current location. [Clarification: some directives have parameters, some don't; those with parameters have a default initial parameter of the current location.]
So, @A3(rdfs:label) is a shortcut for @A3(rdfs:label=@A3).
Similarly, @A3(mm:printf) is a shortcut for @A3(mm:printf=@A3).
All default parameters can be overridden, e.g., @A3(mm:printf=@B9).
If extra parameters are needed, then the implicit first parameter must be made explicit.
For example, if a user wants to use the mm:printf directive to add a "!" to the resolved value from a cell they would need to explicitly write the first parameter:
@A3(mm:printf(@A3, "!")
In the case of the IRI directives, it makes no sense for mm:uuidIRI to have parameters, but mm:camelCaseIRI could (where the implication would be, for example, that @A3(mm:camelCaseIRI) is equal to @A3(mm:camelCaseIRI=@A3)).
Some directives do not have parameters. A shift setting directive, for example, would not have an assignment possibility.
It is not always obvious which directives should have default implicit parameters. [Clarification: the real question is whether is directive should have a parameter or no; if it has one, the default implicit parameter convention applies.]
rdf:ID, mm:rawIRI, rdfs:label definitely have default implicit parameters; mm:uuidIRI definitely should not.
We should decide what is appropriate for the other IRI directives.