docs icon indicating copy to clipboard operation
docs copied to clipboard

Comments on URL-interop.md

Open SimonSapin opened this issue 8 years ago • 5 comments

I’m reading it at commit 863655160ffe6696ece399e4e8ac0e0bf08f7941.

86: must have the scheme present

TWUS: Describes in the 4.2 URL parsing section how a parser should accept URLs without a scheme.

IIRC the TWUS parser only accepts input without a scheme when there’s a base URL. The input is relative, in these cases.

86 has this grammar, which seems equivalent?

URI-reference = URI / relative-ref

It also there divides parsers into "Non-web-browser implementations" without specifying how to make that distinction.

In this specific instance, I think "Non-web-browser" means anything that doesn’t also implement https://w3c.github.io/FileAPI/ since the difference between "basic URL parser" and "URL parser" is all about blob: URLs.

TWUS: says a parser must accept one to an infinite amount of slashes

I think this is really not a big deal. It could just as well be 5 max, but 5 is arbitrary and less theoretically pleasing than http://www.catb.org/jargon/html/Z/Zero-One-Infinity-Rule.html

Real world: 32 bit numbers occur, and are automagically supported if typical OS level name resolver funcitons

When I looked into it, it seemed hard to choose to not support it in such functions. (The most a program could do is recognize such "exotic" IPv4 syntax and reject them with a parse error, if it doesn’t want to resolve the IP address.)

TWUS: Doesn't specify IDNA 2003 nor 2008, but somehow that's still clear

It specified Unicode TR46, which fully defines algorithms independently of IDNA 2003 or 2008. (Though it is based on the Punycode RFC.)

Real world: at least curl and wget2 ignore "rubbish" entered after the number all the way to the next component divider

Personal opinion: it sounds problematic to silently ignore part of the input?

A TWUS URL thus needs other magic to know where a URL ends.

For example in <a href="…"> HTML syntax defines exactly where the value href attribute ends, so there is no need for magic.

If URLs need to be found in the middle of a free-form paragraph of text without any markup, there’s a lot more magic (and heuristics) required than splitting on spaces. I think defining this does not belong in an URL spec.

TWUS has a test suite (that only runs in javacript-enabled browsers).

Part (arguably the most important part) of this test suite has its test cases in a JSON file that can be used without JavaScript (and is in rust-url).

SimonSapin avatar Feb 08 '17 16:02 SimonSapin

IIRC the TWUS parser only accepts input without a scheme when there’s a base URL

Right, clearly wrong of me. That's virtually the same as 86. I took away that mistake.

In this specific instance, I think "Non-web-browser" means anything that doesn’t also implement ...

I suppose that's so too. It got removed as well now when I cleaned up the scheme flaws.

When I looked into it, it seemed hard to choose to not support it in such functions.

Yes, as long as you mean 32bit numbers and you use the stock name resolver functions. The trickier part is the dotted numerical version that isn't 4 numerical fields. But still, that's not part of 86.

Personal opinion: it sounds problematic to silently ignore part of the input?

Both yes and no. When it comes to curl, the original approach was to only interfere where it had to and get everything else as far as it can. So you could send in illegal things in the URL and it would be used in the end anyway, and that could help users torture their servers to send crap other clients wouldn't.

Over time that has turned out harder and a bit error-prone so we've had to make the parser stricter over time, but it still has a fairly lenient approach and the focus is that if you pass it a legal URL it should parse it and work it it. The illegal URLs are not always rejected (sort of garbage in, garbage out). But over time I think we're slowly rejecting more and more illegal URLs.

For example in <a href="…"> HTML syntax defines exactly where the value href attribute ends, so there is no need for magic.

Right, but when you accept a white space as a part of a URL, you need something else or another character to specify that. In HTTP headers that other character is typically a newline. If it is within a <a> tag, I suppose the HTML parser would pass on the length.

I should avoid the use of "magic" there and say another method or another character.

Part (arguably the most important part) of this test suite has its test cases in a JSON file

I drowned in all the other things there when I looked previously but I agree that it looks fine. I've pushed a change now that links directly to the source json file.

Thanks for all the feedback. I've done several commits now to clean up.

bagder avatar Feb 10 '17 14:02 bagder

It specified Unicode TR46, which fully defines algorithms independently of IDNA 2003 or 2008. (Though it is based on the Punycode RFC.)

I'm useless when it comes to anything non-ascii so I suppose that's why I'm extra confused by all these IDNA things.

Are you saying that the TR46 document makes it clear to you how to encode IDN host names when doing name resolves and then works with everything, including German ß's?

bagder avatar Feb 10 '17 14:02 bagder

Are you saying that the TR46 document makes it clear to you how to encode IDN host names when doing name resolves

Yes.

and then works with everything, including German ß's?

If you mean "Is implementing that spec sufficient for achieving interoperability with every domain, TLD, and registar in the world", I don’t know. I assume Anne chose TR46 over alternatives because he thought it would provide better, if not perfect, interop.

I just googled for what he wrote about this and found:

https://annevankesteren.nl/2014/06/url-unicode

The reasoning is that it provides an interface compatible with IDNA2003, including almost identical processing, but is based on the IDNA2008 dataset.

SimonSapin avatar Feb 15 '17 00:02 SimonSapin

If you mean "Is implementing that spec sufficient for achieving interoperability with every domain, TLD, and registrar in the world", I don’t know

Then I would say that it isn't that clear to you either. A clear spec would specify the single algorithm that should be used. (And should doesn't then mean that everyone adheres to that spec, just like any other spec in the world.)

bagder avatar Mar 06 '17 19:03 bagder

You’re either misunderstanding or misrepresenting what I wrote. TWUS does specify a single algorithm that, in the opinion of its editors, should be used.

My “I don’t know” was a response to your “works with everything”, in the sense that “everything” is an unbounded set of things and so that question can never be answered. No single person knows all the corner cases of every piece of software that exist in the world.

However if and when we do find out that some aspect of TWUS doesn’t work with something, we can try and tweak TWUS to fix that problem.

SimonSapin avatar Mar 06 '17 23:03 SimonSapin