Update namespace in urn
Related to #1806, the URN-double colon is updated to a single colon in order to be fully compliant to the urn-scheme
Codecov Report
Merging #1808 (3d1797f) into main (c69221b) will increase coverage by
0.06%. The diff coverage isn/a.
@@ Coverage Diff @@
## main #1808 +/- ##
==========================================
+ Coverage 83.55% 83.62% +0.06%
==========================================
Files 44 44
Lines 8102 8102
Branches 2218 2218
==========================================
+ Hits 6770 6775 +5
+ Misses 851 847 -4
+ Partials 481 480 -1
| Impacted Files | Coverage Δ | |
|---|---|---|
| cwltool/provenance.py | 81.58% <ø> (ø) |
|
| cwltool/job.py | 81.88% <0.00%> (+0.39%) |
:arrow_up: |
| cwltool/workflow_job.py | 87.42% <0.00%> (+0.56%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
If we're using the urn:hash scheme then we should follow its recommendations, which in this case has a :: as the media type is unknown. Unfortunately his is stuck in 'draft' stage (now expired).
It is as far as I can tell however compatible with the older RFC2141 you linked to.
<URN> ::= "urn:" <NID> ":" <NSS>
<NID> ::= <let-num> [ 1,31<let-num-hyp> ]
<NSS> ::= 1*<URN chars>
Substituting the example urn:hash::sha1:LBPI666ED2QSWVD3VSO5BG5R54TE22QL we get:
NID = "hash"
NSS = ":sha1:LBPI666ED2QSWVD3VSO5BG5R54TE22QL"
Now the question is if ":" is permitted in <URN chars>:
<URN chars> ::= <trans> | "%" <hex> <hex>
<trans> ::= <upper> | <lower> | <number> | <other> | <reserved>
<other> ::= "(" | ")" | "+" | "," | "-" | "." |
":" | "=" | "@" | ";" | "$" |
"_" | "!" | "*" | "'"
And there it is in second line of <other>.
Now in newer RFC8141 the rules are different:
assigned-name = "urn" ":" NID ":" NSS
NID = (alphanum) 0*30(ldh) (alphanum)
ldh = alphanum / "-"
NSS = pchar *(pchar / "/")
We assign NID and NSS the same, and hash is still valid NID. Now the question is if the NSS can start with : from pchar?
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
So ":" is declared directly, so it should also be valid.
Therefore I don't see the need for this pull request as it is breaking the namespace which was already valid before, if a bit odd-looking (but then so are all the URNs! :-) )
The real fix should be to get rid of insecure SHA1 all together and use sha256 or better, which then can use the newer Naming things with hashes scheme (RFC 6920), which deliberately do not support SHA1 and has a registry of modern checksum algorithms.
Reading the document you linked in the issue linked here, I would expect
:sha1, so no idea why we have the::sha1, nor why this document above also uses the same two colons. Strange.
You are thinking of the separate urn:sha1 scheme which is not well defined.
@MBueschelberger Can you help us by sharing your motivation for this change?
Hi @mr-c,
I actually requested this change since I was not able to parse the graph with any of the available RDFLib-versions. Removing the second colon actually solved the problem for me.
This is why I actually initiated this discussion here about the origin of this ::.
Best regards
@simleo Do you have experience using rdflib to parse CWL prov aggregates that contain urn:hash::sha1 (note the double colon)?
https://github.com/common-workflow-language/cwltool/pull/1808/files#diff-2bb87b99f4b0d10faa69f7bcec12f404bc4a41ac1aae63f8a14d6460fa5798a5L520
@mr-c no, sorry. In runcrate, I use cwlprov-py to read CWLProv ROs, but I don't use rdflib directly.