More IDNA roundtrippability issues
Here are a few more issues (from @valenting in https://github.com/whatwg/url/issues/603#issuecomment-1462034815). We need to sort out whose fault this is: the spec or the whatwg-url implementation. (I've also included a few additional examples to defeat ASCII-only fast path in Chrome.)
| input | whatwg-url | Chrome | WebKit | Live URL Viewer |
|---|---|---|---|---|
| http://a.xn--xn-----/ | http://a.xn----/ | http://a.xn--xn-----/ | error | link |
| http://é.xn--xn-----/ | http://xn--9ca.xn----/ | error | error | link |
| http://a.xn----/ | http://a.-/ | http://a.xn----/ | error | link |
| http://é.xn----/ | http://xn--9ca.-/ | error | error | link |
| http://a.xn--/ | http://a./ | http://a.xn----/ | error | link |
| http://é.xn--/ | http://xn--9ca./ | error | error | link |
Without digging too deep, it seems like Punycode-decoding all of these labels result in an all-ASCII label, that should never have been Punycode-encoded in the first place. However, RFC 3492 says the following:
Using hyphen-minus as the delimiter implies that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points, but IDNA forbids such strings from being encoded.
I'm not yet sure where in IDNA this requirement is set, but if Unicode IDNA included this requirement then that'd probably solve this issue.
Update: Indeed, IDNA2003's ToUnicode (https://www.rfc-editor.org/rfc/rfc3490#section-4.2) includes:
Verify that the sequence begins with the ACE prefix, and save a copy of the sequence.
Remove the ACE prefix.
Decode the sequence using the decoding algorithm in [PUNYCODE] and fail if there is an error. Save a copy of the result of this step.
Apply ToASCII.
Verify that the result of step 6 matches the saved copy from step 3, using a case-insensitive ASCII comparison.
Basically, it includes roundtrippability test as part of the ToUnicode algorithm. This test is absent from UTS 46's ToUnicode and processing steps.
Update 2: IDNA2008's Domain Name Lookup Protocol (https://www.rfc-editor.org/rfc/rfc5891.html#section-5) has the same roundtrippability test. Section 5.3 has:
If the input to this procedure appears to be an A-label (i.e., it starts in "xn--", interpreted case-insensitively), the lookup application MAY attempt to convert it to a U-label … If the label is converted to Unicode (i.e., to U-label form) using the Punycode decoding algorithm, then the processing specified in [the following] two sections MUST be performed, and the label MUST be rejected if the resulting label is not identical to the original.
The following two sections would basically validate the U-label, and then convert the U-label back into an A-label using Punycode. So this test is essentially equivalent to the IDNA2003 version.
AFAIK WebKit uses the ICU library for IDNA. In ICU a check is added to report failure on xn-- and xn--ASCII- labels after the "If the label starts with “xn--”" step. This check hasn't been added to UTS 46 standard yet. More info here:
- https://github.com/unicode-org/icu/pull/1234
- https://unicode-org.atlassian.net/browse/ICU-21030
This explains, why these tests return an error in the WebKit, but success in the whatwg-url.
Interesting, per comments on the second issue @markusicu already submitted feedback for this, but it apparently hasn't been processed yet? @macchiati do you happen to know if that feedback is still pending or did it get lost?