linkml icon indicating copy to clipboard operation
linkml copied to clipboard

No support for URNs that don't provide an authority part

Open Silvanoc opened this issue 2 years ago • 7 comments

Describe the bug No support for URNs that don't provide an authority (the part after :// and before the next /).

Apparently only URIs with an authority section (the part starting with // after the scheme name, like http://, ftp://, tenet://,...) are supported, all others are being interpreted as CURIEs. I've tested it with the URIs provided here: https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Example_URIs

To reproduce

I'd like to be able to use UUIDs as identifiers. Unfortunately I'm having following issues trying to convert to TTL (with linkml-convert):

  1. Although I've declared the corresponding slot with identifier: true and range: string, is failing reporting Unknown CURIE prefix: @base.
  2. I'm then trying to use UUID URNs, declaring them according the specification (urn:uuid:<UUID>) reports Unknown CURIE prefix: urn.
  3. Let's try to specify a prefix to overcome it. I cannot declare the prefix urn:uuid, since I get the error Not a valid NCName.
  4. Let's then try to use urn: as a prefix, but without providing any URL or similar: It fails with ValueError: prefix_reference must be supplied.
  5. I then specify urn: "urn:", but it fails with the error ValueError: urn:: is not a valid URI.
  6. urn: urn isn't a solution either, because although it doesn't fail, it results in an invalid URN: urnuuid:<UUID>.
  7. One to last try uuid: "urn:uuid" results in the error ValueError: urn:uuid: is not a valid URI.
  8. Very last try is uuid: "urn:uuid:" results in the error ValueError: urn:uuid:: is not a valid URI.

Expected behavior

I expect that:

  1. URNs are usable directly wherever range uriorcurie has been specified and
  2. TTL resulting from a conversion with linmk-convert leaves the URN untouched, since URNs are supported according the Turtle Primer

Additional context

Creating this issue after this Slack conversation: https://obo-communitygroup.slack.com/archives/C04EU7JL1NF/p1682076200991049

Silvanoc avatar Apr 21 '23 14:04 Silvanoc

Hi @Silvanoc, Any news on your issue? I am also interested in using UUID as identifier within LinkML and wanted to get your feedback. Best regards,

cpauvert avatar Jun 14 '23 14:06 cpauvert

AFAIK no news on this issue. I haven't tried it out again since I reported it, but I assume that it hasn't been fixed yet.

I haven't had neither the time nor the priority to try to provide a patch myself.

Silvanoc avatar Jun 15 '23 06:06 Silvanoc

I stumbled once again upon this issue... The main problem is that LinkML is not using so-called "safe CURIEs" and therefore we have foreseen ambiguities that are IMO being resolved in a wrong way. I'll elaborate on it:

  1. LinkML claims to support URIs.
  2. linkml-runtime code is trying to validate if an element is a CURIE in such a way that valid URIs without the optional authority component are being expected to be CURIEs and therefore failing validation.

This is a discussion where maintainers opinion is key, therefore involving @cmungall in this discussion.

IMO LinkML should decide one of these options (order according my preference, top is most preferred):

  1. Modify current URI/CURIE validation code in such a way that is does disambiguation this way:
    • what is not a valid URL (the preferred and expected format),
    • is then checked if it's a valid CURIE (we limit supported CURIEs to those with a prefix listed in the schema),
    • otherwise it's check if it's a valid URI,
    • otherwise validation fails.
  2. Migrate to use "Safe-CURIEs", adapting the documentation and code accordingly. Looking at the CURIEs last specification, there's the datatype URIorSafeCURIE, but not URIorCURIE. What would speak for this approach if we want to be correct.
  3. Stop claiming support for URIs and change documentation to support for URLs and CURIEs (which would be better called CURLEs then). In that case possibility of using invalid URLs/CURIEs like nonsense://invalid should be disabled (as of now it's considered valid). This fully excludes URNs.

Supporting option 2 should be pretty straightforward and I might provide a patch for it, since I'm interested on getting this issue fixed maintaining URI support. Option 1 is tougher, but I could also try to provide a patch for that one. I'm against option 3.

Silvanoc avatar Sep 12 '23 09:09 Silvanoc

I think we should support 1. However, for CURIE validation I don't think the complete validation should be done by the CURIE class, as this would require out of band knowledge about what prefixes are being used. Instead, CURIE.py should be inclusive, and valid and complete validation should be performed by a validator that is provided with sufficient context.

A guiding principle here should be consistency with JSON-LD. Ideally a JSON-serialized LinkML object should map to the correct RDF when provided with a context derived from the schema. I think that may contraindicate 2, but need to check.

Needless to say we are not in favor of 3, and apologies on the tardiness of this!

cmungall avatar Sep 13 '23 15:09 cmungall

2 would be the most robust way to resolve it (that's why it's the recommended disambiguation solution in the spec). But I assume that such a breaking change (it would interpret any CURIEs in old schemas as URIs) is not desired, that's why I haven't taken it as my favourite.

1 is a compromise, but in order to work LinkML would need to somehow define some rules on how to differentiate CURIEs from URIs without an authority component. It's important to notice any backwards compatible rule supporting the "unsafe" CURIE format will necessarily "shadow"[^1] correct URIs (as of now is "shadowing" all URIs without an authority component, what is a lot). Going for this approach IMO LinkML should try to minimize the "shadowed" URIs.

I'm glad to read that you're not in favor of 3 🙂


WRT Option 1

I've also noticed the lack of context information for the UriOrCurie validation about the available prefixes. But I've also noticed that the function telling if something is a CURIE(is_curie) accepts an argument that somehow looks like it would be the way to pass that function exactly that context: nsm. Perhaps @hsolbrig as the author of the code can clarify it.

[^1]: With URI "shadowing" I mean that something like iot:thing can be at the same time a valid CURIE and a valid URI. If LinkML tooling considers it a CURIE because the prefix iot has been declared, then it will be shadowing the valid URI iot:thing.

Silvanoc avatar Sep 13 '23 15:09 Silvanoc

See also #258

cmungall avatar Sep 14 '23 01:09 cmungall

@cmungall IMO we can move the conversation to #258 and close this issue. I'm not closing it since you already labeled it and also added it to the milestone.

Silvanoc avatar Sep 14 '23 07:09 Silvanoc

Can this be closed? What needs to be done to "move the conversation to #258" - did you want the comments to be copied over, or just have further conversation there?

nlharris avatar Nov 08 '24 22:11 nlharris

Can this be closed? What needs to be done to "move the conversation to #258" - did you want the comments to be copied over, or just have further conversation there?

I just wanted to have further conversation there. That is what I meant with "move". So yes, it can be closed.

Silvanoc avatar Nov 09 '24 07:11 Silvanoc