purl-spec icon indicating copy to clipboard operation
purl-spec copied to clipboard

Revert "Fix missing percent-encoding in purl example with repository_url"

Open maxhbr opened this issue 1 year ago • 11 comments

This reverts commit 37626c43ace346f38f390f9266ef99eef654bf66.

Escaping slashes in qualifiers necessary and does not match the test cases nor the spec: https://github.com/package-url/purl-spec/blob/37626c43ace346f38f390f9266ef99eef654bf66/test-suite-data.json#L110-L121

see also: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#character-encoding

maxhbr avatar Jul 12 '22 11:07 maxhbr

Escaping slashes in qualifiers necessary and does not match the test cases nor the spec:

IMO it does match the spec, which says for qualifiers: "A value must be a percent-encoded string".

And it also makes sense that way, I believe, so I'd argue that the test cases are wrong.

Can you clarify @pombredanne?

sschuberth avatar Jul 12 '22 12:07 sschuberth

The example pkg:generic/[email protected]?download_url=https://openssl.org/source/openssl-1.1.0g.tar.gz&checksum=sha256:de4d501267da would then be rather unreadable:

pkg:generic/[email protected]?download_url=https%3A%2F%2Fopenssl.org%2Fsource%2Fopenssl-1.1.0g.tar.gz%26checksum%3Dsha256%3Ade4d501267da

maxhbr avatar Jul 12 '22 12:07 maxhbr

The example pkg:generic/[email protected]?download_url=https://openssl.org/source/openssl-1.1.0g.tar.gz&checksum=sha256:de4d501267da would then be rather unreadable:

I don't think that readability is a concern here, but parsability.

sschuberth avatar Jul 12 '22 17:07 sschuberth

Please check my comment in https://github.com/package-url/purl-spec/pull/168#discussion_r881693906 , going against URI would be an questionable design choice ;-)

I was not aware of the test-suite, i guess it should be corrected then to cover this case correctly.

jsteinhofff avatar Jul 12 '22 19:07 jsteinhofff

in https://datatracker.ietf.org/doc/html/rfc3986#section-3.4 I read:

However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.

which is pretty clear and / and ? are also added to the list of allowed characters in

query = *( pchar / "/" / "?" )

The characters slash ("/") and question mark ("?") may represent data within the query component.

I am unsure if rf3986 (Jan 2005) or the above mentioned rfc2396 (Aug 1998) is applicable

maxhbr avatar Jul 13 '22 04:07 maxhbr

the specification references rfc3986 and not rfc3296: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#a-purl-is-a-url

and that contains an example with an unescaped / in the query parameters:

      ftp://cnn.example.com&[email protected]/top_story.htm

EDIT: i misread that example :smile:

maxhbr avatar Jul 13 '22 05:07 maxhbr

it is sometimes better

...

which is pretty clear

I'd also argue that a vague wording that involves "sometimes" is far from clear...

sschuberth avatar Jul 13 '22 06:07 sschuberth

the specification references rfc3986 and not rfc3296: https://github.com/package-url/purl-spec/blob/master/PURL-SPECIFICATION.rst#a-purl-is-a-url

and that contains an example with an unescaped / in the query parameters:

      ftp://cnn.example.com&[email protected]/top_story.htm

IMO you fell into the "semantic attack" trap described in that very RFC document: The host in that example is not "cnn.example.com", but "10.0.0.1". And consequently, the "/" before "top_story.htm" is not part of the query string, but of the path.

sschuberth avatar Jul 13 '22 06:07 sschuberth

it is sometimes better

...

which is pretty clear

I'd also argue that a vague wording that involves "sometimes" is far from clear...

However, the more important quote from https://datatracker.ietf.org/doc/html/rfc3986#section-3.4 to me is

The characters slash ("/") and question mark ("?") may represent data within the query component.

This convinces me that it's not wrong use use these characters unescaped / literally. Would you agree @jsteinhofff?

sschuberth avatar Jul 13 '22 06:07 sschuberth

Yes @sschuberth , i think @maxhbr is right, first of all rfc3986 seems to be "the better RFC to refer to" (although i could not find a clear statement telling so, just concluding form what is used more broadly). Second the RFC definitely allows to use unescaped "/" and "?" in the query component.

Now that it is allowed the question is what PURL should make out of it, and there i think the critical statement is

If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

in https://datatracker.ietf.org/doc/html/rfc3986#section-2.2.

And conflicts in theory could happen, so the question is whether you want to leave it up to the user to check if conflicts are happening (hopefully they do so and dont oversee anything) or whether a consistent requirement "percent-encode repository_url value" is better.

Conflicts could happen if the repository URL contains:

  • authentication with @
  • a query by itself
  • a subpath by itself

I cant judge if those are realistic to come up, at least the uses-cases i can foresee here will not contain any of those, so technically i see the risk rather low. Still it might be worth a mention in the readme that repository_url should be percent-encoded if it contains any of the things listed above.

Regarding readability: i think this pro and con. Yes, the URL value gets much less readable but on the otherhand visually it then just becomes an obscure URL value and the PURL information is clearly distinguished form it.

jsteinhofff avatar Jul 13 '22 08:07 jsteinhofff

on the otherhand visually it then just becomes an obscure URL value and the PURL information is clearly distinguished form it.

Yes, I agree here.

Anyway, if we merge this revert, we should also change the docs that say "A value must be a percent-encoded string", probably swapping the "must" for a more lax recommendation along the lines of what @jsteinhofff suggested to avoid conflicts.

sschuberth avatar Jul 13 '22 10:07 sschuberth