url
url copied to clipboard
Allow protocol setter to switch between special and non-special schemes
Chrome, Firefox, and Safari all agree on this behavior:
u = new URL("custom-scheme://host?initially-no-path");
u.protocol = "https";
u.protocol = "custom-scheme";
u.href; // custom-scheme://host/?initially-no-path
Even though it is a little strange, I have reason to believe that people are using the URL protocol setter to do this already.
Isn't this rather problematic as it changes the nature of the host? How do you deal with that, reparse? Query also currently branches on "is special".
There is a whole host of problems here (pun definitely intended) but I'm just suggesting that this is the reality of URLs and the spec might want to reflect that. We could make the protocol setter re-parse the serialized URL or something like that.
Yeah, I guess that is what we would have to do.
cc @TimothyGu
@achristensen07 would you be able to describe the algorithm that Safari uses for the protocol setter, including the reparse step?
The current algorithm we use is as follows ( See URL::setProtocol at https://trac.webkit.org/browser/webkit/trunk/Source/WTF/wtf/URL.cpp#L390 ):
- Take the value and remove any colon and anything after it
- If the value is empty, does not start with an ASCII alpha character, or contains anything but ASCII alpha characters, '+', '-', or '.' (currently implemented in URLParser::maybeCanonicalizeScheme), return the unchanged URL.
- If the "URL" is an invalid URL (an HTML anchor tag) then try parsing value then ':' then the invalid URL string (this step is strange and seems worth removing and not standardizing)
- If value is "file" and the URL has credentials or a port then return the unchanged URL. (This is an implementation of https://github.com/whatwg/url/pull/269 in a different place)
- If the URL's protocol is already "file" and has a host, return the unchanged URL. (This seems related to step 4, but I'm not sure what the history is.)
- Serialize the current URL to a string, remove all characters before the first colon, then parse value with a colon and that string.
It seems like there are restrictions around "file", reparsing the whole URL, and removal of stuff after the colon that are common to all browsers.
I think this is possible, but can be quite tricky. IMO this operation should avoid anything which would cause other URL components to be interpreted with a different meaning, and that may require a lot of testing/fuzzing/etc to figure out. But I think it's possible.
A few cases I can think of:
-
Empty hostnames should be checked before re-parsing. They're not allowed in special URLs (except file) anyway, so it's fine to bail, but
sc:////////////notahostwould be reparsed tohttp://notahost/which seems dangerous. -
Unescaped backslashes should be checked and maybe percent-encoded?
sc://host/some/path\..\..\etc\passwdgets reparsed ashttp://host/etc/passwd.
I like the idea of ensuring the parts don't get reinterpreted too much before allowing a non-special URL to switch to a special scheme. As an example, to allow switching to file, I think it's reasonable to require that:
- The source URL must not have an opaque path.
To switch to http, we should additionally require:
- The source URL must have a host. (Technically this implies the first requirement.)
As an additional example, Firefox and Safari currently have the following behavior:
u = new URL('javascript:!!notahost()');
u.protocol = 'http:';
console.assert(u.href, 'http://!!notahost()/');
// console.assert(u.href, 'http://%21%21notahost%28%29/'); // Chrome
which I don't think is useful or worth keeping.
It also sounds like escaping \ would be quite useful for this proposal. I filed #675 to track this.
Not that complicated, and no need to reparse the URL.
The alternative for switching to http for URLs with an empty or absent host, without serialising and reparsing the URL is to remove all empty path segments up to and including the first non-empty path segment, then parse that non-empty segment as an authority and adjust the username/password/host/port properties accordingly (or fail otherwise).
You’d have to adjust for the scheme dependent percent coding though.
That may sound complex, but there’s a lot of advantages to defining that as an operation on URL records. It is key in recovering a formal grammar and reference resolution as per RFC 3986 for WHATWG URLs.
I don’t have any way to take an object of properties and turn them into a URL, except Object.assign(new URL(foo), properties) - and there’s not a good blank URL to use that’s also special, so that properties can be any URL. Why arbitrarily prevent transitions if it’s mutable at all?
Hi folks,
We've finally shipped the special scheme to non-special scheme restriction in bug 1347459 in Firefox 117, but a regression has been reported: Bug 1850954 - Cant change URL.protocol since v117
Chrome has also landed this recently in https://bugs.chromium.org/p/chromium/issues/detail?id=1416018 , which isn't yet in release.
Is there a better way for sites to change the scheme other than this?
new URL("customprotocol://" + oldUrl.host + oldUrl.pathname + oldUrl.search + oldUrl.hash)
Do we commit to this restriction or is there any intent to allow special/non-special protocol changes?
The relevant webkit bug https://bugs.webkit.org/show_bug.cgi?id=229427
@ricea thoughts?
Is there a better way for sites to change the scheme other than this?
This is how we would have to do it as well, given that host fundamentally changes. So setting protocol would essentially be setting href with some pre-work.
@hayatoito is handling this in Chromium and has more context than me.
I received a question on https://bugs.chromium.org/p/chromium/issues/detail?id=1416018 and posted a reply on https://bugs.chromium.org/p/chromium/issues/detail?id=1416018#c11.
I just assumed that the standard intentionally included this limitation.
It's definitely intentional as changes to the scheme would end up changing other components as well, which is rather unexpected.
Having said that, https://url.spec.whatwg.org/#potentially-strip-trailing-spaces-from-an-opaque-path does so as well so there is precedent now, but that's a very limited impact. Changing the scheme however would end up changing the host, which is central to the authority of a URL. I'd rather not do that personally if we can get away with it.
Thanks.
I currently don't intend to undo the Chrome fix, at least for now. The fix will be part of the M118 release to make Chrome align with the current URL standard, unless we encounter a significant issue.
If we decide to permit setting protocols, please inform me by updating the issue or here.
I thought this was a straightforward bug fix, and wasn't aware that there is on-going discussion to change the standard here. :(