url icon indicating copy to clipboard operation
url copied to clipboard

Allow protocol setter to switch between special and non-special schemes

Open achristensen07 opened this issue 3 years ago • 17 comments
trafficstars

Chrome, Firefox, and Safari all agree on this behavior:

u = new URL("custom-scheme://host?initially-no-path"); u.protocol = "https"; u.protocol = "custom-scheme"; u.href; // custom-scheme://host/?initially-no-path

Even though it is a little strange, I have reason to believe that people are using the URL protocol setter to do this already.

achristensen07 avatar Nov 23 '21 16:11 achristensen07

Isn't this rather problematic as it changes the nature of the host? How do you deal with that, reparse? Query also currently branches on "is special".

annevk avatar Nov 23 '21 16:11 annevk

There is a whole host of problems here (pun definitely intended) but I'm just suggesting that this is the reality of URLs and the spec might want to reflect that. We could make the protocol setter re-parse the serialized URL or something like that.

achristensen07 avatar Nov 23 '21 16:11 achristensen07

Yeah, I guess that is what we would have to do.

cc @TimothyGu

annevk avatar Nov 23 '21 17:11 annevk

@achristensen07 would you be able to describe the algorithm that Safari uses for the protocol setter, including the reparse step?

TimothyGu avatar Nov 23 '21 18:11 TimothyGu

The current algorithm we use is as follows ( See URL::setProtocol at https://trac.webkit.org/browser/webkit/trunk/Source/WTF/wtf/URL.cpp#L390 ):

  1. Take the value and remove any colon and anything after it
  2. If the value is empty, does not start with an ASCII alpha character, or contains anything but ASCII alpha characters, '+', '-', or '.' (currently implemented in URLParser::maybeCanonicalizeScheme), return the unchanged URL.
  3. If the "URL" is an invalid URL (an HTML anchor tag) then try parsing value then ':' then the invalid URL string (this step is strange and seems worth removing and not standardizing)
  4. If value is "file" and the URL has credentials or a port then return the unchanged URL. (This is an implementation of https://github.com/whatwg/url/pull/269 in a different place)
  5. If the URL's protocol is already "file" and has a host, return the unchanged URL. (This seems related to step 4, but I'm not sure what the history is.)
  6. Serialize the current URL to a string, remove all characters before the first colon, then parse value with a colon and that string.

It seems like there are restrictions around "file", reparsing the whole URL, and removal of stuff after the colon that are common to all browsers.

achristensen07 avatar Nov 24 '21 05:11 achristensen07

I think this is possible, but can be quite tricky. IMO this operation should avoid anything which would cause other URL components to be interpreted with a different meaning, and that may require a lot of testing/fuzzing/etc to figure out. But I think it's possible.

A few cases I can think of:

  • Empty hostnames should be checked before re-parsing. They're not allowed in special URLs (except file) anyway, so it's fine to bail, but sc:////////////notahost would be reparsed to http://notahost/ which seems dangerous.

  • Unescaped backslashes should be checked and maybe percent-encoded? sc://host/some/path\..\..\etc\passwd gets reparsed as http://host/etc/passwd.

karwa avatar Nov 29 '21 19:11 karwa

I like the idea of ensuring the parts don't get reinterpreted too much before allowing a non-special URL to switch to a special scheme. As an example, to allow switching to file, I think it's reasonable to require that:

  • The source URL must not have an opaque path.

To switch to http, we should additionally require:

  • The source URL must have a host. (Technically this implies the first requirement.)

As an additional example, Firefox and Safari currently have the following behavior:

u = new URL('javascript:!!notahost()');
u.protocol = 'http:';
console.assert(u.href, 'http://!!notahost()/');
// console.assert(u.href, 'http://%21%21notahost%28%29/'); // Chrome

which I don't think is useful or worth keeping.


It also sounds like escaping \ would be quite useful for this proposal. I filed #675 to track this.

TimothyGu avatar Nov 30 '21 05:11 TimothyGu

Not that complicated, and no need to reparse the URL.

alwinb avatar Jan 05 '22 20:01 alwinb

The alternative for switching to http for URLs with an empty or absent host, without serialising and reparsing the URL is to remove all empty path segments up to and including the first non-empty path segment, then parse that non-empty segment as an authority and adjust the username/password/host/port properties accordingly (or fail otherwise).

You’d have to adjust for the scheme dependent percent coding though.

That may sound complex, but there’s a lot of advantages to defining that as an operation on URL records. It is key in recovering a formal grammar and reference resolution as per RFC 3986 for WHATWG URLs.

alwinb avatar Jan 25 '22 21:01 alwinb

I don’t have any way to take an object of properties and turn them into a URL, except Object.assign(new URL(foo), properties) - and there’s not a good blank URL to use that’s also special, so that properties can be any URL. Why arbitrarily prevent transitions if it’s mutable at all?

ljharb avatar Aug 27 '23 23:08 ljharb

Hi folks,

We've finally shipped the special scheme to non-special scheme restriction in bug 1347459 in Firefox 117, but a regression has been reported: Bug 1850954 - Cant change URL.protocol since v117

Chrome has also landed this recently in https://bugs.chromium.org/p/chromium/issues/detail?id=1416018 , which isn't yet in release.

Is there a better way for sites to change the scheme other than this?

new URL("customprotocol://" + oldUrl.host + oldUrl.pathname + oldUrl.search + oldUrl.hash)

Do we commit to this restriction or is there any intent to allow special/non-special protocol changes?

valenting avatar Aug 31 '23 15:08 valenting

The relevant webkit bug https://bugs.webkit.org/show_bug.cgi?id=229427

karlcow avatar Aug 31 '23 23:08 karlcow

@ricea thoughts?

Is there a better way for sites to change the scheme other than this?

This is how we would have to do it as well, given that host fundamentally changes. So setting protocol would essentially be setting href with some pre-work.

annevk avatar Sep 01 '23 06:09 annevk

@hayatoito is handling this in Chromium and has more context than me.

ricea avatar Sep 01 '23 08:09 ricea

I received a question on https://bugs.chromium.org/p/chromium/issues/detail?id=1416018 and posted a reply on https://bugs.chromium.org/p/chromium/issues/detail?id=1416018#c11.

I just assumed that the standard intentionally included this limitation.

hayatoito avatar Sep 04 '23 01:09 hayatoito

It's definitely intentional as changes to the scheme would end up changing other components as well, which is rather unexpected.

Having said that, https://url.spec.whatwg.org/#potentially-strip-trailing-spaces-from-an-opaque-path does so as well so there is precedent now, but that's a very limited impact. Changing the scheme however would end up changing the host, which is central to the authority of a URL. I'd rather not do that personally if we can get away with it.

annevk avatar Sep 04 '23 06:09 annevk

Thanks.

I currently don't intend to undo the Chrome fix, at least for now. The fix will be part of the M118 release to make Chrome align with the current URL standard, unless we encounter a significant issue.

If we decide to permit setting protocols, please inform me by updating the issue or here.

I thought this was a straightforward bug fix, and wasn't aware that there is on-going discussion to change the standard here. :(

hayatoito avatar Sep 04 '23 07:09 hayatoito