url
url copied to clipboard
Should we unescape characters in path?
Consider
<a href="https://jsdom.github.io/whatwg-url/">normal</a>,
<a href="https://jsdom.github.io/wh%61twg-url/">encoded 'a'</a>
In Chrome, both a
elements have pathname "/whatwg-url/", indicating that the %61 was unescaped.
In Safari, the second a
element has pathname "/wh%61twg-url/", but when navigating, /whatwg-url/
is actually the destination.
In Firefox, the second a
element has pathname "/wh%61twg-url/", and that pathname is used when navigating (resulting in a 404 error with GitHub pages). However, confusingly, the address bar in the browser chrome has the valid "whatwg-url", so if you go to the address bar and press Enter it works:
The spec currently doesn't do any unescaping. Should we?
It sounds like a good idea to decode them, IMO. The latest HTTP semantics draft spec says:
Scheme-based normalization (Section 6.2.3 of [RFC3986]) of "http" and "https" URIs involves the following additional rules: ... Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [RFC3986]).
(I believe this would also apply to at least the query)
We already do the other HTTP-specific normalisations (removing default ports, root path instead of empty, lowercased host name), as well as other normalisations (e.g. exotic IP addresses), so I think it makes sense to do this, too. Some part of the system will have to - best to do it as soon as possible at the URL level to avoid mismatches like those you’ve described.
Related issues: #369, #565 and #87. Quick questions, probably discussed there:
- Which set of characters (code points?) would then be unescaped (aka. the unreserved set)
- What then to do with invalid sequences becoming valid sequences after unescaping, such as e.g.
http://example/%1%61
I think #87 gives the reasons for not doing this.
It is still important to define the semantics of escape sequences, for server-side URL handling and for interoperability though. Currently the standard does not discuss that.
Filed a bug in Chromium to track this. https://crbug.com/1252531
FWIW, I'm not sure I agree with the notion that we need to define the semantics. Or at least in such a way that there is only one valid interpretation. Servers can interpret URLs however they wish and they do not necessarily need to agree with each other on that. If one server considers a
and A
the same and another does not, that's perfectly acceptable.
What about the semantics of percent encoded sequences though? Not the additional protocol- or application specific normalisations, but the analogue of Percent-Encoding Normalization in the RFC.
I don't think the URL Standard currently states that %61
may be considered equivalent to a
, other than in the domain (where it is obligatory). It might even make sense to mandate that, so that it is safe to assume that e.g. http://example/%61
and http://example/a
refer to the same resource, whether browsers normalise them to the same URL or not.
If we do make the semantics explicit then the question arises of how to correct for invalid escape sequences. So that %6%31
does not end up being (percent-encoding-) equivalent with %61
and then transitively with a
.
We already define that %2E
and %2e
are equivalent to .
in path components (even a mixed component like .%2E
is treated equivalently to ..
).
We also add percent-encoding for certain characters - both in the parser and via the component setters. It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended.
Basically, if we're not happy with saying that %61
must always be the same as a
(and that is the only interpretation), then the following operation:
var url = new URL("http://example.com/foo/bar");
url.href; // "http://example.com/foo/bar"
url.pathname = "what should this do???";
url.href; // "http://example.com/what%20should%20this%20do%3F%3F%3F"
Should also fail. Otherwise, the string what%20should%20this%20do%3F%3F%3F
is not necessarily the same as what should this do???
.
I don't think that's workable. At least for ASCII code-points, they must be equivalent.
It logically follows that we expect anybody who processes the resulting URL to treat them as equivalent, otherwise we would have produced a URL which points to a different resource than the user intended.
Maybe? That really depends on whether the user knows what the parser will do.
I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work, but I'm not sure why we'd mandate things we cannot really require. If a server wants to treat %61
and a
differently, it can.
I agree with Anne. Ideally we'd do as little processing as possible on a URL, and let the server handle them as well as they can.
There are corner cases beyond percent encoding. For example http://example.com/path/to//file
(two slashes) and http://example.com/path/to/file
(one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.
I think it's okay to say that for path/query/fragment we generally expect https://url.spec.whatwg.org/#string-percent-decode to work
A good way to go about that is to point out in the standard that there are multiple normalisations/ equivalence relations on URLs that are in common use, and add a section, maybe? to explain that. I suppose, in the WHATWG style it would contain algorithms that compute them. (With a statement that they are not normative for browsers, where this is the case).
Percent encoding normalisation / equivalence is a good one to start with.
Collapsing slashes could be another one to mention. IIRC this is relevant also to another open issue about JS module specifiers.
I suppose we could add something to https://url.spec.whatwg.org/#url-equivalence that is very clearly scoped to non-browser contexts. There's also query parameter order and such (to the extent we want to acknowledge query parameters there, not sure).
Treating a feature like this as part of equivalence instead of canonical serialization would mean it definitely wouldn't apply to JS module system instance identification. For example, all JS module systems compare the href
not based on the equivalence algorithm and for this reason fragments result in separate module instances already, and users have already adopted this as a feature. Not sure which way I stand on that but just noting that these decisions should always be considered with reference to their modules implications at this point please.
@guybedford yeah, the same is true for many other systems out there. And to be clear, it wouldn't change the default equivalence algorithm, it would just be an option if you want more URLs to be the same (that also can reasonably be argued to the be same).
The issue #565 which was just closed as duplicate has some additional discussion which might be worth checking out by those following this issue. But I agree on keeping any further discussion here.
Also reading #369 again, it has an especially good description. That issue also explains the address bar behaviour of Firefox.
My take is that we have to make a decision about how to handle invalid escape sequences. Not knowing/ deciding on how to handle those leads to the current situation by default rather than by making a deliberate decision. And if it ends up not being used by browsers, then it is still valuable to specify a recommended behaviour for other applications.
To move on with https://github.com/whatwg/url/issues/606#issuecomment-930109895, I think it is needed. And I think we have to revisit the reserved characters. Unless we’d go for a comparison on URL records with fully decoded components. But I’m not sure that is desirable, especially in the query string.
By the way, I think adding something there is really great!
Any news on this? The regression around https://github.com/actix/actix-web/pull/2398 is sadly holding us back updating actix-web to the latest beta. https://github.com/svenstaro/miniserve/pull/677
I'm currently exploring implementing this in Swift, as over-encoding/removing over-encoding is an important feature for interop with our existing RFC-2396 URL type, as well as a generally useful feature. Having looked a the previous issues, I'm reasonably convinced this is possible. I'm not seeing any insurmountable challenges.
Maybe? That really depends on whether the user knows what the parser will do.
I don't really find this very satisfying; the same argument could be made the other way. If the user is expected to have a deep and detailed understanding of the parser, any behaviour is reasonable and nothing needs to be justified. It's a kind of cyclical reasoning where things happen because they happen.
If a server wants to treat %61 and a differently, it can.
On the one hand, this is is demonstrably true because - well, form-encoding 😔. A +
and a %2B
may certainly be different depending on how the query is interpreted.
On the other hand, at least for some characters in some components, that behaviour would not appear to be web compatible. Routers, caches and CDNs will sometimes decode these bytes, and expect that they do not change the meaning of the URL. The discussions in previous issues seems to indicate that many browsers very much do expect these to be equivalent. This leads to the idea that we need some kind of "unreserved set" (perhaps per-component).
Such a server would serve different resources to different browsers for the same URL, which seems at-odds with the idea of interoperability or the web as a platform. The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server. If GHP is indeed entitled to behave that way, it suggests that all browsers which successfully navigate to that URL are wrong - which again, does not seem to be a web-compatible position.
There are corner cases beyond percent encoding. For example http://example.com/path/to//file (two slashes) and http://example.com/path/to/file (one slash) are essentially equivalent from the filesystem's point of view, but depending on the web server you're using, they might not be. While the URL parser could say that we should collapse the two paths, it's probably more important that we keep the processing to a minimal in order to not change the URL's initial form.
The difference, IMO, is that the URL parser does not add or remove empty path components (any more! It used to do that to file URLs). It does, however, add and remove percent-encoding, meaning there is already implicit acceptance that doing so does not change the meaning of the URL.
By definition, if the parser does something (e.g. turning http://ex%61mple.com
-> http://example.com
), it must preserve meaning, as any attempt to utter the former as a URL record results in the latter, and URLs are records:
A URL is a struct that represents a universal identifier. To disambiguate from a valid URL string it can also be referred to as a URL record.
We are forced to accept that the web's model of URLs, as defined by the various browser implementations over the decades, includes this assumption that percent-encoding may be safely added or removed in certain circumstances, and that a standard which attempts to describe that model must define that process and the circumstances where it applies.
The evidence in this issue indicates that GitHub Pages is apparently performing as you say it may, and it breaks Firefox's ability to navigate to certain websites hosted on that server.
I don't think it does? Browsers that do not behave like Firefox (and Safari behaves like Firefox so something changed since OP) would not be able to visit the 404 at https://jsdom.github.io/wh%61twg-url/
as they would instead get the resource at https://jsdom.github.io/whatwg-url/
which is different.
I'm going to treat this as a clarification issue as per https://github.com/whatwg/url/issues/606#issuecomment-930109895. PRs welcome.
@annevk By "treat this as a clarification issue", you mean continuing to preserve percent-encoding of characters outside the RFC 3986 reserved set in https://url.spec.whatwg.org/#path-state ? That is permitted by the current HTTP semantics document (which does not require normalization but does make clear that such gratuitous encoding maintains the interpretation of a URI, i.e. it still identifies the same resource), although it does put more burden on user code that is trying to be robust. How would you feel about an issue requesting a normalize
method?
Yeah. A new API seems fair (though see also https://whatwg.org/faq#adding-new-features and https://whatwg.org/working-mode#changes). Deciding on the semantics might be tricky, but hopefully we can figure something out. (Would be nice if that also tackled query params and such.)