url
url copied to clipboard
javascript: URL parsing
Noticed via https://github.com/tmpvar/jsdom/issues/1836 by @tsirolnik
Given
const s = 'javascript:window.location.replace("https://whatsoever.com/?a=b&c=5&x=y")';
then there are two separate problems:
new URL(s).pathnamewill givewindow.location.replace("https://whatsoever.com/; the rest will end up in query etc. This seems unexpected, but not fatal, at least.new URL(s).hrefwill givejavascript:window.location.replace("https://whatsoever.com/?a=b&c=5&x=y%22), i.e. it percent encodes the rest of the string, including the closing quote.
The second of these is especially bad, because per HTML's navigate algorithm, how javascript: URLs are executed is by serializing them to a string then stripping the leading javascript:. So this will give a syntax error, which is not how browsers behave.
I think javascript: URLs need to be special-cased in the URL parser, unfortunately.
I suspect a very similar bug report could apply to data: URLs. Although I just rediscovered today that data: URLs are very underspecified, per https://simonsapin.github.io/data-urls/, so maybe that's a separate can of worms...
HTML's navigate algorithm also applies percent decode though. Which would turn that into the original URL. So you need a more contrived example / test to demonstrate the issue you're after.
Wow, somehow I completely skipped over that step; many apologies. I'll see what I can do... I'm hoping that this spec was written with some browser reverse-engineering in hand so it should match somebody at least.
I'll also try to test what people do for <a>.pathname for such cases.
So this is interesting. Live URL viewer indicates:
- Firefox stores the data somewhere else entirely which is not accessible except through
href - Chrome almost matches the spec, except it does not percent-encode the closing quote in the search
- Safari Tech Preview matches the spec entirely (yay!)
- Edge stores everything in the pathname
Can you think of a more convoluted example where parsing, serializing, percent-decoding, and stripping off leading javascript: does not round trip? My intuition is that there would be one; those steps don't really seem like the reverse of each other. But I can't find one by tinkering.
At this point I'm leaning toward just adding tests and not changing anything in either spec. I think it might be nicer for developers if everything was in the pathname, but I understand the argument about not wanting to pollute the URL parser with scheme-specific parsers.
Maybe a compromise would be to define this "recover everything after the scheme as a string" operation so that other specs like HTML could refer to it? You could also imagine a theoretical spec for a mail client which receives mailto: URLs wanting to use such a mechanism.
I can't really think of something. I was thinking something with % maybe, but we don't do anything special with it.
I'm happy to include the abstract operation you propose.
https://jsdom.github.io/whatwg-url/#url=amF2YXNjcmlwdDovL3Rlc3Q6dGVzdC8lMGFhbGVydCgxKQ==&base=YWJvdXQ6Ymxhbms=
I think this URL should fail to parse per the standard, correct? That does not seem to match browsers.
It matches Safari TP, but maybe that's a problematic case, indeed.
In Safari TP http://software.hixie.ch/utilities/js/live-dom-viewer/saved/5134 shows an alert when clicking the link. That is not explained by following the spec's algorithm since it needs a URL record to serialize AIUI.
https://wpt.live/url/url-setters.any.html has tests that expect to be able to set a username and password on a javascript URL. This makes me uncomfortable. Do you want me to file a separate issue for it?
@achristensen07 do you have thoughts on this? It seems WebKit might have inconsistent handling of javascript: URLs. E.g., on https://jsdom.github.io/whatwg-url/ javascript://test:test does not throw, but in new URL() it does. Contrast that with test://test:test.
@ricea I could see us making a distinct decision there (e.g., modifying "cannot have a username/password/port" to include the "javascript" scheme), so that's probably fair. But I think we need something stronger than "uncomfortable", especially if the parser does still allow for them to be included.
Instead of "uncomfortable" how about "semantically meaningless"?
IIUC, javascript:// is a script containing a single comment. Setting username on it then modifies that comment to have an extra string at the beginning? It appears to be a harmless but nonsensical operation. But if there turned out to be a scenario in which it actually had some functional impact I would regret it. To me it looks like a security hole waiting for somebody to come along and find a way to exploit it.
I think that argument is worse as there are many operations the API allows for that we don't necessarily know the semantics of. There are essentially endless schemes and we only have knowledge of a couple of them. And that we have special knowledge of them is mainly an accident of history.
(I could see being more strict in Location and HTMLHyperLinkElementUtils though.)
I tried to go through this again:
- @domenic's original example is now interoperable between WebKit and Chromium, but requires a new Live URL Viewer URL: https://jsdom.github.io/whatwg-url/#url=amF2YXNjcmlwdDp3aW5kb3cubG9jYXRpb24ucmVwbGFjZSgiaHR0cHM6Ly93aGF0c29ldmVyLmNvbS8/YT1iJmM9NSZ4PXkiKQ==&base=YWJvdXQ6Ymxhbms=
- @zcorpan's examples still show a difference https://jsdom.github.io/whatwg-url/#url=amF2YXNjcmlwdDovL3Rlc3Q6dGVzdC8lMGFhbGVydCgxKQ==&base=YWJvdXQ6Ymxhbms= errors in WebKit, but not Chromium. And http://software.hixie.ch/utilities/js/live-dom-viewer/saved/5134 alerts in both WebKit and Chromium.
If you inject 	 into the scheme of the latter example you get some kind of error in WebKit, but Chromium still shows the alert.
I think ideally we fix these cases by failing to parse, as WebKit already does in new URL(). Thoughts?
I'm happy to add the relevant tests.
Yeah, Chromium had a lot of javascript URL parsing changes a couple of years ago. I believe that is now fully aligned with the HTML spec, but of course the URL parser itself is not aligned yet.
I think failing to parse in the URL parser is the best idea. Unfortunately, the non-special URL parsing in Chrome still doesn't actually parse apart authority and path.
I'm not sure why WebKit behaves differently between new URL() and <a href> though – I presume the latter uses a different parser for javascript: URLs?
I made an example where giving the special non-special scheme treatment to a javascript: URL results in different behaviour:
javascript://host/1%0a//../0/;alert('non-opaque path');/%0aalert('opaque path');/..///
https://jsdom.github.io/whatwg-url/#url=amF2YXNjcmlwdDovL2hvc3QvMSUwYS8vLi4vMC87YWxlcnQoJ25vbi1vcGFxdWUgcGF0aCcpOy8lMGFhbGVydCgnb3BhcXVlIHBhdGgnKTsvLi4vLy8=&base=YWJvdXQ6Ymxhbms=
This is obviously a very contrived example, but it illustrates that applying non-special treatment to javascript: URLs is surprising.
Interesting, for that case I get non-opaque path in WebKit in <a> as well. I'm still happy with HTML and URL as they are here, but we should add these tests.
It would probably take a usecounter to be sure, but I strongly suspect most (functional) javascript: URLs in the wild do not start with javascript:/ or javascript://. The first case is possible for regexes and the second is possible if the underlying JavaScript code is multi-line, but they both seem quite unusual/contrived. If this is indeed the case, then I suppose most of our conversation here is more academic than actually material.