url icon indicating copy to clipboard operation
url copied to clipboard

IDNA Utils

Open indolering opened this issue 8 years ago • 11 comments

Ticket tracking discussion of restoring the URL.domainToASCII and URL.domainToUnicode functions or implementing something new.

Summary

Processing international domain name labels is tricky, slow, and requires large lookup tables. However, browsers already perform this task (typically using the ICU library) and could expose these functions to JavaScript.

The proposal to add this functionality was nixed because no major browser had implemented it. Node supports the call (<50 lines) and a WebKit developer chimed in saying it would be trivial to add.

One issue is which version of ToUnicode function should be exposed and whether there are other utility functions that might be needed, such as subdomain comparisons, distinguishing between domains/subdomains/TLD/public suffix, and IP address parsing.

indolering avatar Mar 15 '17 04:03 indolering

My two cents: I'm worried that attaching additional parsing to this feature will result in an implementation hazard.

I was thinking about how to implement domain parsing yesterday and it would be convenient if the URL object contained that information. However, I would assume that the frequent updates to the Public Suffix List would make it hard to maintain compatibility between browsers and versions.

The size and complexity of an IP parsing library is nowhere near that of Stringprep, Nameprep, and Punycode. But if it's easy to do, IP parsing is a reasonable request to make of the standard library for a web-centric programming language.

WRT to which version of ToUnicode ... make it configurable :woman_shrugging:?

indolering avatar Mar 15 '17 05:03 indolering

@mikewest any new thoughts on all this?

I'm mainly asking the other questions since I wonder whether we should introduce a URLHost object rather than a couple of one-off utility methods.

(I don't think we want ToUnicode to be configurable. Each extra bit of API surface just leads to lots of bugs. Better to start out small.)

annevk avatar Mar 15 '17 08:03 annevk

Also a host parser can be added to URLHost utility collection; see: https://github.com/whatwg/url/pull/218#issuecomment-276665579 and https://github.com/whatwg/url/pull/218#issuecomment-276698315

rmisev avatar Mar 15 '17 10:03 rmisev

const host = new URLHost(rawInput)
host.toString() // probably ASCII, as per usual
host.unicode() // ToUnicode?
host.type // "ipv4", "ipv6", "domain"

Alternatively you could make ToUnicode an argument to toString() somehow, similar to https://tc39.github.io/ecma262/#sec-number.prototype.tostring. Not sure if that's a precedent to follow however.

annevk avatar Mar 15 '17 10:03 annevk

I don't think we want ToUnicode to be configurable. Each extra bit of API surface just leads to lots of bugs. Better to start out small.

Well, which "version" of ToUnicode do we want, the standard ICU implementation or what the browser URL bar displays?

Alternatively you could make ToUnicode an argument to toString()

I just don't think that overloading host.toString() is appropriate because the Punycode/Nameprep transform is very specific to DNS.

indolering avatar Mar 15 '17 19:03 indolering

Thinking this over, I think it should output the standard ToUnicode function, as that's easier to standardize across environments (i.e. Node.js).

indolering avatar Mar 15 '17 22:03 indolering

I do think something like this would be useful, and Node's implementation seems like a reasonable justification for paving the cowpath. If WebKit and Mozilla are also interested, I think Blink would follow suit.

That said, @sleevi had some concerns in https://github.com/whatwg/url/issues/63#issuecomment-286851402. CCing him here.

mikewest avatar Mar 16 '17 05:03 mikewest

@indolering note that there's no such thing as "standard" ToUnicode. I think we should be using https://url.spec.whatwg.org/#concept-domain-to-unicode which we already use in various places throughout the platform. I don't think we should expose variants, which I think was @sleevi's concern in that other thread. (Also note that our host parser is very specific to DNS already, since it already involves Punycode/Nameprep due to ToASCII which is applied on input.)

annevk avatar Mar 16 '17 08:03 annevk

note that there's no such thing as "standard" ToUnicode.

I'll take your word for it! It's my preference for a single implementation to be shared across browsers and Node. AFAIK, this isn't the case when it comes to what's displayed in the URL bar. But IDNA makes me go cross-eyed, so I'll stop inserting myself.

indolering avatar Mar 16 '17 18:03 indolering

I created a PR for this since we've got interest now from WebKit and Chrome. I'm a little worried about all the incompatibilities we still have with IDNA, but those are also exposed in other ways already.

@achristensen07 I'd appreciate review of #288 from you since you said WebKit would be interested in something like this.

What should be done before landing:

  • Add examples. If anyone here is willing to contribute some, that'd be great!
  • Write web-platform-tests. Again, help appreciated. If anyone needs guidance, I'm happy to help.

annevk avatar Mar 31 '17 13:03 annevk

Yeah, I do want to echo the concerns, and I'll loop @mikewest onto some design docs he may not have been aware of when he expressed support :)

sleevi avatar Mar 31 '17 13:03 sleevi