purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

qualifiers sorting

Open jdillon opened this issue 5 years ago • 6 comments

The spec reads

If the qualifiers are not empty and not composed only of key/value pairs where the value is empty:
...
 * sort this list of qualifier strings lexicographically
...

Is the "qualifier strings" the entire key/value pair or just the key?

More so though why is it important to sort this at all. This implies and ordering; which here in this generic aspect of a purl for qualifiers is erroneous. I can't think of any case where the qualifier order would be important, and if it was important the importance would be based on the initial order given and the implementation of that order highly package-type specific.

It seems to me that this aspect of the spec should be removed; and either replaced by that qualifiers are non-ordered, or that they are ordered and order should be retained from original input purl. I'd suggest though that order here isn't mean to be important so the directive to sort make this a bit problematic and implies that the qualifiers are indeed meant to be ordered.

Since this aspect of a purl is so very format dependent, purl imposing this rule on how to render is dangerous as you could have formats that actually do need ordering here, and others that don't care. I'd suggest formats shouldn't care about order here but the fact that the spec requires order implies its an ordered structure and I think that is generally bad, and at the very least misleading.

jdillon avatar Jan 12 '19 03:01 jdillon

ftr my recommendation for order of qualifiers is that its non-ordered; but a recommendation for normalizing a purl to canonical is to retain input order.

so if you are given:

pkg:FOO/bar/baz@qux?zee=boo;aaa=bar

then the canonical representation is normalized as:

pkg:foo/bar/baz@qux?zee=boo&aaa=bar

... the current spec implies that this should instead be normalized as:

pkg:foo/bar/baz@qux?aaa=bar&zee=boo

This would break if the "foo" type actually had the order of qualifiers as significant. While I would hope no format would do that, I can't predict what folks will invent and why. But at the same time I don't see any reason or value for the specification to enforce order on the qualifiers mapping; except to suggest if not mandate that the order is retained when doing translation and/or normalization.

I think its important for these more flexible aspects of the purl spec to reduce arbitrary mutating of the data w/o reason; as we really can't easily guess how purl may be used in the future; and the spec needs to be a generic as possible while still providing some restrictions on how the identifiers are constructed, parsed and normalized to be useful for interchange.

That said I don't see any value in lexicographically sorting qualifiers.

jdillon avatar Jan 12 '19 03:01 jdillon

@jdillon the qualifiers are un-ordered: their order is not significant (such that aaa=bar&zee=boo and zee=boo&aaa=bar are the same exact qualifiers). And their canonical representation is sorted such that two package URL strings that are the same except for the ordering of their qualifiers are the exact same canonical string.

The rationale for this is that when you compare two canonicalized Package URL strings, the URLs are the same if the canonicalized strings are the same. The main benefit is --for instance-- when a Package URL string is used as a database field or database key or in some UI display: in these cases using (and eventually storing) a canonical version of the Package URL makes sense and avoid treating the same Package URLs as different things if they were not canonicalized.

There is no obligation to use this canonical representation anywhere of course, including for equality tests. You could instead always parse and compare each element separately and in the actual data structure used to implement qualifiers may be or may not be sorted or ordered as long as you can check that they are the same when you ignore their order. Typically, if you use some kind of unordered mapping (typically hash table-based such as a Map, hash, dict, object, etc. depending how this is called in a programming language) to store qualifiers then you can compare qualifiers of two package URLs directly without having to use a canonical representation.

Is the "qualifier strings" the entire key/value pair or just the key?

This should be the entire key/value pair. We need to clarify the spec alright for this.

pombredanne avatar Jan 18 '19 14:01 pombredanne

Okay fair enough.


Is the "qualifier strings" the entire key/value pair or just the key?

This should be the entire key/value pair. We need to clarify the spec alright for this.

any reason why it would be the entire key/value? since key must be unique within the keys of the qualifiers string, sorting the key/value seems only relevant if you could have duplicate keys? since duplicates are not allowed, isn't sorted key enough to ensure normality?

jdillon avatar Jan 18 '19 19:01 jdillon

So this seems like a very specific use-case where rendered purl strings would be used as exact match strings.

I can see how that may be useful, though as a spec requirement I think its fairly pretentious and unneeded. If a specific application wants to use a package-url as a string equivalent constant to use as a key, then I would suggest it implement a normalizable algorithm specific to the application.

One specific reason for this is that you might imagine a package-url qualifier scheme where order is actually important. So for this reason I suggest order should just be ignored spec-wise.

Applications that require order can easily implement what they need here and avoid spec complication.

jdillon avatar Apr 16 '19 01:04 jdillon

@jdillon sorry for the late reply:

any reason why it would be the entire key/value? since key must be unique within the keys of the qualifiers string, sorting the key/value seems only relevant if you could have duplicate keys? since duplicates are not allowed, isn't sorted key enough to ensure normality?

Yes it would be. Do you mind a PR to clarify this?

pombredanne avatar Nov 26 '19 17:11 pombredanne

I can see how that may be useful, though as a spec requirement I think its fairly pretentious and unneeded. If a specific application wants to use a package-url as a string equivalent constant to use as a key, then I would suggest it implement a normalizable algorithm specific to the application.

This is to ensure that we have a normalized string representation that can be safely used for equality. You do not have to use it though, so I am not sure that is much of an issue.

One specific reason for this is that you might imagine a package-url qualifier scheme where order is actually important. So for this reason I suggest order should just be ignored spec-wise.

Qualifiers ordering is undefined, but the fact that sorting is used in a canonical forms means that this is not significant. Also for most practical implementations I can think of, qualifiers would be implemented as some kind of mapping or hash table, most of which are unordered. Finally things would likely be made brittle IMHO if we had a purl where the order of qualifiers would matter. Is there an actual problem that you have faced somehow?

pombredanne avatar Nov 26 '19 22:11 pombredanne