Some URL-interop feedback
It seems to omit discussion of query, which has the most historical baggage due to encodings coming into play.
The infinite number of slashes is only a thing for special URLs. So it doesn't affect all URLs.
86 doesn't say it's the first @. It simply disallows multiple @ and it's only error handling is that you fail parsing altogether if you find multiple.
The URL Standard also only considers 86 IPv4 addresses as valid, but it parses more variants.
IDNA: we already discussed this. The UTS 46 ToASCII algorithm defines how they work (though there are some issues as you can find in the URL repository).
The URL Standard actually restricts ports to a 16-bit integer. That's different from 86 which has no such restriction.
Apart from space there's a number of ASCII code points that are parsed in the URL Standard whereas 86 would reject. That's true for most components I think.
E.g., you say there's no issues with fragments, but currently Firefox encodes spaces there, but other browsers do not. Or did you mean to limit this to issues that affect network protocols?
Thanks for the good input @annevk !
It seems to omit discussion of query, which has the most historical baggage due to encodings coming into play.
So the query part is different than the path in that aspect? It omits that part because I've never experienced any such problems and I'm unaware of those details. More problems than "just" non-ascii parts?
The infinite number of slashes is only a thing for special URLs. So it doesn't affect all URLs.
Really? It may just show how hard time I have to read TWUS. Which URLs does it affect and which doesn't it affect?
86 doesn't say it's the first @. It simply disallows multiple @
To me, it pretty clearly specifies @ as a separator which thus implies that the first occurrence ends the 'userinfo' field. Section 3.2 and 3.2.1. A '@' as part of the userinfo has to be URL encoded as %40.
IDNA: we already discussed this.
Yes, but I'm not clever enough to understand, so to me there's still a problem.
The URL Standard actually restricts ports to a 16-bit integer. That's different from 86 which has no such restriction.
Ah yes, good point. I've added a mention about that now.
Apart from space there's a number of ASCII code points that are parsed in the URL Standard whereas 86 would reject. That's true for most components I think.
Hm, that's valuable input. I should probably run some tests and figure some of that out to make that section more complete. Do you have any more detailed guesses/clues of ASCII codes that this could be? The spaces are what I've seen happen in the real world and none of the others have bitten me (ie no users have reported problems with other code points).
you say there's no issues with fragments, but currently Firefox encodes spaces there, but other browsers do not. Or did you mean to limit this to issues that affect network protocols?
Not sure. I wasn't aware of any issues so I didn't list any. So other browsers encode other incoming spaces in URLs but leaves the spaces if they're part of the fragment? How inconsistent. I suppose in my case I've not experienced this issue very much since I work with URLs much more outside of browsers where spaces normally are not part of URLs at all. Plus, fragments are more of a browser thing and not that used by non-browsers.
So the query part is different than the path in that aspect? It omits that part because I've never experienced any such problems and I'm unaware of those details. More problems than "just" non-ascii parts?
Yes, the encoding of the document is sometimes used for the percent-encoding, rather than just UTF-8. This doesn't happen frequently and would probably mostly affect legacy non-Western content, although legacy Western content can certainly be affected too.
Which URLs does it affect and which doesn't it affect?
Only URLs with a special scheme.
To me, it pretty clearly specifies @ as a separator which thus implies that the first occurrence ends the 'userinfo' field.
But it also disallows @ in userinfo so you can only have one. So if you are going to do error handling, whatever you do you cannot claim it is defined in some way.
Do you have any more detailed guesses/clues of ASCII codes that this could be?
Backtick, various braces, etc. Whether they get encoded or not can make/break various things.