purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

Clarifications

Open jgustie opened this issue 6 years ago • 9 comments

I have been going over the specification and I have a few minor things I was hoping to get clarification on.

The checksum qualifier doesn't have a formal restriction on the algorithm name, I'm assuming it should be one ASCII letter followed by any number of ASCII alphanumerics (possibly with the addition of a hyphen, though that seems like it could conflict with subresource integrity's use of "-" as a delimiter). Also, should the canonical form of checksum list be deduped or sorted?

When parsing, should "strip leading and trailing /" include runs of slashes or just a single slash? It seems minor, but with the introduction of the "pkg:" scheme, it means that the type could lead with a run. I wasn't sure if multiple slash removal was always necessary or not.

With character encoding, the Wikipedia article cites RFC 3986 in reference to the reserved and unreserved characters, I'm assuming that all reserved characters should be encoded (unless explicitly used as a purl delimiter) and none of the unreserved characters should be encoded. That is, when a reserved character has no special meaning to purl, it should still be encoded (e.g. the Maven GAV o'doyle:rules!:1.0 shoud be pkg:maven/o%27doyle/rules%[email protected] and not pkg:maven/o'doyle/[email protected]). Also, would referencing RFC 3986 directly make more sense then the Wikipedia article, it seems like one is less of a moving target then the other.

jgustie avatar Mar 23 '18 14:03 jgustie

Also, because it always comes up eventually, how to handle space: pkg:example/foo%20bar@1%20GA?foo=bar+gus pkg:example/foo%20bar@1%20GA?foo=bar%20gus

jgustie avatar Mar 23 '18 14:03 jgustie

@jgusties Thanks! Let me comment in a few days (travelling ATM)

pombredanne avatar Mar 28 '18 20:03 pombredanne

The checksum qualifier doesn't have a formal restriction on the algorithm name, I'm assuming it should be one ASCII letter followed by any number of ASCII alphanumerics (possibly with the addition of a hyphen, though that seems like it could conflict with subresource integrity's use of "-" as a delimiter).

Good point. We could specify this as suggested as one ASCII letter followed by any number of ASCII alphanumerics including -and_` . Since the actual checksum value is then separated by a colon and is specified as HEX, there would not be any parsing conflicts

Also, should the canonical form of checksum list be deduped or sorted?

Excellent point too: it does make sense to have them normalized to all lowercase, and the list deduped and sorted by lexicographically for a canonical form.

pombredanne avatar Apr 02 '18 14:04 pombredanne

When parsing, should "strip leading and trailing /" include runs of slashes or just a single slash? It seems minor, but with the introduction of the "pkg:" scheme, it means that the type could lead with a run. I wasn't sure if multiple slash removal was always necessary or not.

This needs to refined for clarity, but I consider this as removing all and any leading and trailing slashes in this case. There should not be any leading or trailing slashes left after this,

pombredanne avatar Apr 02 '18 14:04 pombredanne

With character encoding, the Wikipedia article cites RFC 3986 in reference to the reserved and unreserved characters, I'm assuming that all reserved characters should be encoded (unless explicitly used as a purl delimiter) and none of the unreserved characters should be encoded. That is, when a reserved character has no special meaning to purl, it should still be encoded (e.g. the Maven GAV o'doyle:rules!:1.0 shoud be pkg:maven/o%27doyle/rules%[email protected] and not pkg:maven/o'doyle/[email protected]).

Also, would referencing RFC 3986 directly make more sense then the Wikipedia article, it seems like one is less of a moving target then the other.

yes and yes. Your interpretation and reading and suggestion are correct and to the point. We should reference RCF 3986 alright and indeed pkg:maven/o%27doyle/rules%[email protected] would be the right way in the case you mentioned.

pombredanne avatar Apr 02 '18 15:04 pombredanne

@jgustie Do you care for submitting a PR? or shall I add these refinements and clarification myself?

pombredanne avatar Apr 02 '18 15:04 pombredanne

I'll defer to avoid any contributor issues and to keep the language/voice consistent ;)

I would suggest leaving "-" out of the algorithm production though: like I said there are specifications where "-" is the delimiter instead of ":" and by proactively disallowing it you make it easier for algorithm names to be used interchangeably.

Also what about the qualifier value encoding of spaces (i.e. + or %20)? I'm assuming everywhere else %20 is appropriate.

jgustie avatar Apr 02 '18 16:04 jgustie

@jgustie ok, I will take care to update the spec alright.

Fair enough to exclude the - from the algo. And likely _ too, right?

As for encoding of spaces, I am inclined to use %20 everywhere. Do it make sense?

pombredanne avatar Apr 02 '18 17:04 pombredanne

Apologies for jumping on an inactive thread, but I also had some points I was hoping to get clarity on:

  • The specification mentions "lowercase" transformation. Can we clarify that this is a locale-independent case-folding operation? For example, Turkish has a locale-specific mapping from uppercase I (U+0049 LATIN CAPITAL LETTER I) to ı (U+0131 LATIN SMALL LETTER DOTLESS I), so performing a localized lowercase of the string INFO produces ınfo not info, as one might expect.
  • In the section "How to parse a purl string in its components" the specification mentions "left" and "right". For non-LTR scripts or strings with bidirectional markers, these terms can be ambiguous. It would be more correct to describe these as "start" and "end" of the string.

mattt avatar Jul 21 '21 17:07 mattt