url More IDNA roundtrippability issues

Here are a few more issues (from @valenting in https://github.com/whatwg/url/issues/603#issuecomment-1462034815). We need to sort out whose fault this is: the spec or the whatwg-url implementation. (I've also included a few additional examples to defeat ASCII-only fast path in Chrome.)

input	whatwg-url	Chrome	WebKit	Live URL Viewer
http://a.xn--xn-----/	http://a.xn----/	http://a.xn--xn-----/	error	link
http://é.xn--xn-----/	http://xn--9ca.xn----/	error	error	link
http://a.xn----/	http://a.-/	http://a.xn----/	error	link
http://é.xn----/	http://xn--9ca.-/	error	error	link
http://a.xn--/	http://a./	http://a.xn----/	error	link
http://é.xn--/	http://xn--9ca./	error	error	link

Without digging too deep, it seems like Punycode-decoding all of these labels result in an all-ASCII label, that should never have been Punycode-encoded in the first place. However, RFC 3492 says the following:

Using hyphen-minus as the delimiter implies that the encoded string can end with a hyphen-minus only if the Unicode string consists entirely of basic code points, but IDNA forbids such strings from being encoded.

I'm not yet sure where in IDNA this requirement is set, but if Unicode IDNA included this requirement then that'd probably solve this issue.

Update: Indeed, IDNA2003's ToUnicode (https://www.rfc-editor.org/rfc/rfc3490#section-4.2) includes:

Verify that the sequence begins with the ACE prefix, and save a copy of the sequence.

Remove the ACE prefix.

Decode the sequence using the decoding algorithm in [PUNYCODE] and fail if there is an error. Save a copy of the result of this step.

Apply ToASCII.

Verify that the result of step 6 matches the saved copy from step 3, using a case-insensitive ASCII comparison.

Basically, it includes roundtrippability test as part of the ToUnicode algorithm. This test is absent from UTS 46's ToUnicode and processing steps.

Update 2: IDNA2008's Domain Name Lookup Protocol (https://www.rfc-editor.org/rfc/rfc5891.html#section-5) has the same roundtrippability test. Section 5.3 has:

If the input to this procedure appears to be an A-label (i.e., it starts in "xn--", interpreted case-insensitively), the lookup application MAY attempt to convert it to a U-label … If the label is converted to Unicode (i.e., to U-label form) using the Punycode decoding algorithm, then the processing specified in [the following] two sections MUST be performed, and the label MUST be rejected if the resulting label is not identical to the original.

The following two sections would basically validate the U-label, and then convert the U-label back into an A-label using Punycode. So this test is essentially equivalent to the IDNA2003 version.

Mar 09 '23 19:03 TimothyGu

AFAIK WebKit uses the ICU library for IDNA. In ICU a check is added to report failure on xn-- and xn--ASCII- labels after the "If the label starts with “xn--”" step. This check hasn't been added to UTS 46 standard yet. More info here:

https://github.com/unicode-org/icu/pull/1234
https://unicode-org.atlassian.net/browse/ICU-21030

This explains, why these tests return an error in the WebKit, but success in the whatwg-url.

Mar 09 '23 20:03 rmisev

Interesting, per comments on the second issue @markusicu already submitted feedback for this, but it apparently hasn't been processed yet? @macchiati do you happen to know if that feedback is still pending or did it get lost?

Mar 10 '23 14:03 annevk

Sorry, my fault. It's approved but I am behind on UTC action items.

[165-A48] Action Item for Markus Scherer, Editorial Committee: Update UTS #46 to validate ACE label edge cases, see L2/20-240 item F7. For Unicode 14.

There are a couple of others relevant for UTS46... :-/

Mar 10 '23 18:03 markusicu