url Allow protocol setter to switch between special and non-special schemes

trafficstars

Chrome, Firefox, and Safari all agree on this behavior:

u = new URL("custom-scheme://host?initially-no-path"); u.protocol = "https"; u.protocol = "custom-scheme"; u.href; // custom-scheme://host/?initially-no-path

Even though it is a little strange, I have reason to believe that people are using the URL protocol setter to do this already.

Nov 23 '21 16:11 achristensen07

Isn't this rather problematic as it changes the nature of the host? How do you deal with that, reparse? Query also currently branches on "is special".

Nov 23 '21 16:11 annevk

There is a whole host of problems here (pun definitely intended) but I'm just suggesting that this is the reality of URLs and the spec might want to reflect that. We could make the protocol setter re-parse the serialized URL or something like that.

Nov 23 '21 16:11 achristensen07

Yeah, I guess that is what we would have to do.

cc @TimothyGu

Nov 23 '21 17:11 annevk

@achristensen07 would you be able to describe the algorithm that Safari uses for the protocol setter, including the reparse step?

Nov 23 '21 18:11 TimothyGu

The current algorithm we use is as follows ( See URL::setProtocol at https://trac.webkit.org/browser/webkit/trunk/Source/WTF/wtf/URL.cpp#L390 ):

Take the value and remove any colon and anything after it
If the value is empty, does not start with an ASCII alpha character, or contains anything but ASCII alpha characters, '+', '-', or '.' (currently implemented in URLParser::maybeCanonicalizeScheme), return the unchanged URL.
If the "URL" is an invalid URL (an HTML anchor tag) then try parsing value then ':' then the invalid URL string (this step is strange and seems worth removing and not standardizing)
If value is "file" and the URL has credentials or a port then return the unchanged URL. (This is an implementation of https://github.com/whatwg/url/pull/269 in a different place)
If the URL's protocol is already "file" and has a host, return the unchanged URL. (This seems related to step 4, but I'm not sure what the history is.)
Serialize the current URL to a string, remove all characters before the first colon, then parse value with a colon and that string.

It seems like there are restrictions around "file", reparsing the whole URL, and removal of stuff after the colon that are common to all browsers.

Nov 24 '21 05:11 achristensen07

I think this is possible, but can be quite tricky. IMO this operation should avoid anything which would cause other URL components to be interpreted with a different meaning, and that may require a lot of testing/fuzzing/etc to figure out. But I think it's possible.

A few cases I can think of:

Empty hostnames should be checked before re-parsing. They're not allowed in special URLs (except file) anyway, so it's fine to bail, but sc:////////////notahost would be reparsed to http://notahost/ which seems dangerous.
Unescaped backslashes should be checked and maybe percent-encoded? sc://host/some/path\..\..\etc\passwd gets reparsed as http://host/etc/passwd.

Nov 29 '21 19:11 karwa

I like the idea of ensuring the parts don't get reinterpreted too much before allowing a non-special URL to switch to a special scheme. As an example, to allow switching to file, I think it's reasonable to require that:

The source URL must not have an opaque path.

To switch to http, we should additionally require:

The source URL must have a host. (Technically this implies the first requirement.)

As an additional example, Firefox and Safari currently have the following behavior:

u = new URL('javascript:!!notahost()');
u.protocol = 'http:';
console.assert(u.href, 'http://!!notahost()/');
// console.assert(u.href, 'http://%21%21notahost%28%29/'); // Chrome

which I don't think is useful or worth keeping.

It also sounds like escaping \ would be quite useful for this proposal. I filed #675 to track this.

Nov 30 '21 05:11 TimothyGu

Not that complicated, and no need to reparse the URL.

Jan 05 '22 20:01 alwinb

The alternative for switching to http for URLs with an empty or absent host, without serialising and reparsing the URL is to remove all empty path segments up to and including the first non-empty path segment, then parse that non-empty segment as an authority and adjust the username/password/host/port properties accordingly (or fail otherwise).

You’d have to adjust for the scheme dependent percent coding though.

That may sound complex, but there’s a lot of advantages to defining that as an operation on URL records. It is key in recovering a formal grammar and reference resolution as per RFC 3986 for WHATWG URLs.

Jan 25 '22 21:01 alwinb

I don’t have any way to take an object of properties and turn them into a URL, except Object.assign(new URL(foo), properties) - and there’s not a good blank URL to use that’s also special, so that properties can be any URL. Why arbitrarily prevent transitions if it’s mutable at all?

Aug 27 '23 23:08 ljharb

Hi folks,

We've finally shipped the special scheme to non-special scheme restriction in bug 1347459 in Firefox 117, but a regression has been reported: Bug 1850954 - Cant change URL.protocol since v117

Chrome has also landed this recently in https://bugs.chromium.org/p/chromium/issues/detail?id=1416018 , which isn't yet in release.

Is there a better way for sites to change the scheme other than this?

new URL("customprotocol://" + oldUrl.host + oldUrl.pathname + oldUrl.search + oldUrl.hash)

Do we commit to this restriction or is there any intent to allow special/non-special protocol changes?

Aug 31 '23 15:08 valenting

The relevant webkit bug https://bugs.webkit.org/show_bug.cgi?id=229427

Aug 31 '23 23:08 karlcow

@ricea thoughts?

Is there a better way for sites to change the scheme other than this?

This is how we would have to do it as well, given that host fundamentally changes. So setting protocol would essentially be setting href with some pre-work.

Sep 01 '23 06:09 annevk

@hayatoito is handling this in Chromium and has more context than me.

Sep 01 '23 08:09 ricea

I received a question on https://bugs.chromium.org/p/chromium/issues/detail?id=1416018 and posted a reply on https://bugs.chromium.org/p/chromium/issues/detail?id=1416018#c11.

I just assumed that the standard intentionally included this limitation.

Sep 04 '23 01:09 hayatoito

It's definitely intentional as changes to the scheme would end up changing other components as well, which is rather unexpected.

Having said that, https://url.spec.whatwg.org/#potentially-strip-trailing-spaces-from-an-opaque-path does so as well so there is precedent now, but that's a very limited impact. Changing the scheme however would end up changing the host, which is central to the authority of a URL. I'd rather not do that personally if we can get away with it.

Sep 04 '23 06:09 annevk

Thanks.

I currently don't intend to undo the Chrome fix, at least for now. The fix will be part of the M118 release to make Chrome align with the current URL standard, unless we encounter a significant issue.

If we decide to permit setting protocols, please inform me by updating the issue or here.

I thought this was a straightforward bug fix, and wasn't aware that there is on-going discussion to change the standard here. :(

Sep 04 '23 07:09 hayatoito

url url copied to clipboard

Allow protocol setter to switch between special and non-special schemes

url
url copied to clipboard