purl-spec
purl-spec copied to clipboard
Clarifications
I have been going over the specification and I have a few minor things I was hoping to get clarification on.
The checksum qualifier doesn't have a formal restriction on the algorithm name, I'm assuming it should be one ASCII letter followed by any number of ASCII alphanumerics (possibly with the addition of a hyphen, though that seems like it could conflict with subresource integrity's use of "-" as a delimiter). Also, should the canonical form of checksum list be deduped or sorted?
When parsing, should "strip leading and trailing /" include runs of slashes or just a single slash? It seems minor, but with the introduction of the "pkg:" scheme, it means that the type could lead with a run. I wasn't sure if multiple slash removal was always necessary or not.
With character encoding, the Wikipedia article cites RFC 3986 in reference to the reserved and unreserved characters, I'm assuming that all reserved characters should be encoded (unless explicitly used as a purl delimiter) and none of the unreserved characters should be encoded. That is, when a reserved character has no special meaning to purl, it should still be encoded (e.g. the Maven GAV o'doyle:rules!:1.0
shoud be pkg:maven/o%27doyle/rules%[email protected]
and not pkg:maven/o'doyle/[email protected]
). Also, would referencing RFC 3986 directly make more sense then the Wikipedia article, it seems like one is less of a moving target then the other.
Also, because it always comes up eventually, how to handle space:
pkg:example/foo%20bar@1%20GA?foo=bar+gus
pkg:example/foo%20bar@1%20GA?foo=bar%20gus
@jgusties Thanks! Let me comment in a few days (travelling ATM)
The checksum qualifier doesn't have a formal restriction on the algorithm name, I'm assuming it should be one ASCII letter followed by any number of ASCII alphanumerics (possibly with the addition of a hyphen, though that seems like it could conflict with subresource integrity's use of "-" as a delimiter).
Good point. We could specify this as suggested as one ASCII letter followed by any number of ASCII alphanumerics including
-and
_` . Since the actual checksum value is then separated by a colon and is specified as HEX, there would not be any parsing conflicts
Also, should the canonical form of checksum list be deduped or sorted?
Excellent point too: it does make sense to have them normalized to all lowercase, and the list deduped and sorted by lexicographically for a canonical form.
When parsing, should "strip leading and trailing /" include runs of slashes or just a single slash? It seems minor, but with the introduction of the "pkg:" scheme, it means that the type could lead with a run. I wasn't sure if multiple slash removal was always necessary or not.
This needs to refined for clarity, but I consider this as removing all and any leading and trailing slashes in this case. There should not be any leading or trailing slashes left after this,
With character encoding, the Wikipedia article cites RFC 3986 in reference to the reserved and unreserved characters, I'm assuming that all reserved characters should be encoded (unless explicitly used as a purl delimiter) and none of the unreserved characters should be encoded. That is, when a reserved character has no special meaning to purl, it should still be encoded (e.g. the Maven GAV o'doyle:rules!:1.0 shoud be pkg:maven/o%27doyle/rules%[email protected] and not pkg:maven/o'doyle/[email protected]).
Also, would referencing RFC 3986 directly make more sense then the Wikipedia article, it seems like one is less of a moving target then the other.
yes and yes. Your interpretation and reading and suggestion are correct and to the point. We should reference RCF 3986 alright and indeed pkg:maven/o%27doyle/rules%[email protected]
would be the right way in the case you mentioned.
@jgustie Do you care for submitting a PR? or shall I add these refinements and clarification myself?
I'll defer to avoid any contributor issues and to keep the language/voice consistent ;)
I would suggest leaving "-" out of the algorithm production though: like I said there are specifications where "-" is the delimiter instead of ":" and by proactively disallowing it you make it easier for algorithm names to be used interchangeably.
Also what about the qualifier value encoding of spaces (i.e. +
or %20
)? I'm assuming everywhere else %20
is appropriate.
@jgustie ok, I will take care to update the spec alright.
Fair enough to exclude the -
from the algo. And likely _
too, right?
As for encoding of spaces, I am inclined to use %20
everywhere. Do it make sense?
Apologies for jumping on an inactive thread, but I also had some points I was hoping to get clarity on:
- The specification mentions "lowercase" transformation. Can we clarify that this is a locale-independent case-folding operation? For example, Turkish has a locale-specific mapping from uppercase I (U+0049 LATIN CAPITAL LETTER I) to ı (U+0131 LATIN SMALL LETTER DOTLESS I), so performing a localized lowercase of the string
INFO
producesınfo
notinfo
, as one might expect. - In the section "How to parse a purl string in its components" the specification mentions "left" and "right". For non-LTR scripts or strings with bidirectional markers, these terms can be ambiguous. It would be more correct to describe these as "start" and "end" of the string.