url icon indicating copy to clipboard operation
url copied to clipboard

Percent encoding does not round-trip

Open karwa opened this issue 5 years ago • 2 comments

Hi!

I'm trying to implement the current version of the spec in Swift (for non-browser applications. The idea is that it's useful to have a URL type that behaves as your browser behaves and accepts/rejects the same things).

One issue that I've noticed is that the current definition of the "UTF8 percent encode" algorithm doesn't round-trip for strings which contain the percent character. For example, following the algorithm, the string "%100" (UTF8: [37, 49, 48, 48]) is not changed at all when encoding (regardless or the percent-encoding set; none of them contain the "%" character itself). However, decoding that same string using the decoding algorithm in the spec results in the UTF8 sequence [16, 48], or "0" (ASCII 0x10 is the unprintable "data link escape" character).

RFC-3986 warns about this:

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI.

Indeed, the encodeURIComponent JS function encodes the percent character:

> encodeURIComponent("%100")
"%25100"

I was ready to submit a PR to have the spec's algorithm also do this, but it appears there is an explicit test that percent characters are not escaped (in this case, in the URL's username component): https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json#L2372

    "input": "http://%25DOMAIN:[email protected]/",
    "base": "about:blank",
    "href": "http://%25DOMAIN:[email protected]/",
    "origin": "http://foodomain.com",
    "protocol": "http:",
    "username": "%25DOMAIN",   // <--- I would expect this to be "%2525DOMAIN"

I'm not sure if this is correct. Usernames are required to be escaped (meaning they must be unescaped to recover their original value), as in the following example:

> new URL("http://;DOMAIN:[email protected]/")
...
username: "%3BDOMAIN" // <--- Semicolon is escaped as %3B

However, unescaping the string "%25DOMAIN" as the test expects would not recover the original and result in a different username - "%DOMAIN".

Can anybody confirm the correct behaviour? I think we should be escaping the % character, and that the test is incorrect. If that's not the case (and the current behaviour is correct), this could use a note or warning in the spec.

karwa avatar Jun 02 '20 12:06 karwa

See #513 and #518 for some upcoming changes in this area. They don't really change the status quo, but do add percent-encode sets that include U+0025, including one that is an exact match for the one in JavaScript.

And I guess we could add a note that it really depends on the caller which of the various percent-encode sets is safe to use.

annevk avatar Jun 02 '20 13:06 annevk

I can confirm that the spec and tests are correct, i.e. they match browsers. See https://jsdom.github.io/whatwg-url/#url=aHR0cDovLyUyNURPTUFJTjpmb29iYXJAZm9vZG9tYWluLmNvbS8=&base=YWJvdXQ6Ymxhbms=

domenic avatar Jun 02 '20 13:06 domenic

For clarity, if your username is %25DOMAIN you would have to percent-encode it before stuffing it in the URL: %2525DOMAIN. The URL parser will not percent-encode % for you. #727 attempts to clarify this. (My example doesn't use username, but the same principle applies.)

annevk avatar Dec 19 '22 15:12 annevk