node icon indicating copy to clipboard operation
node copied to clipboard

Punycode decoding is not working in rare cases

Open nebulade opened this issue 2 years ago • 2 comments

Version

16.5.0

Platform

various linux distros tried

Subsystem

No response

What steps will reproduce the bug?

On some IDNs a correct punycoded domain will break url.domainToUnicode() and thus other parts within nodejs like dns.lookup() and such.

The domain in question is xn--mgba6a0dd.com which on its own as such works, however if there are subdomains with a combination of . +_ the API returns an empty string, which should not be the case from my knowledge as well as https://www.punycoder.com/ deals with all of them correctly.

To reproduce via the node repl:

> require('url').domainToUnicode('foo._bar.xn--mgba6a0dd.com')
''
> require('url').domainToUnicode('foo.bar.xn--mgba6a0dd.com')
'foo.bar.قازاق.com'
> require('url').domainToUnicode('foo_bar.xn--mgba6a0dd.com')
'foo_bar.قازاق.com'

How often does it reproduce? Is there a required condition?

No response

What is the expected behavior?

No response

What do you see instead?

No response

Additional information

No response

nebulade avatar Jul 15 '21 12:07 nebulade

@nodejs/dns

gireeshpunathil avatar Jul 31 '21 11:07 gireeshpunathil

The behavior described here appears consistent with UTS #46’s ToASCII with UseStd3ASCIIRules set to true. It is “strongly recommended” to do so there, but the reason the parameter can be false in the first place is to leave room for cases where the range of acceptable labels is broader than it “should” be due to convention or web-reality exceptions (see STD3 Rules).

The use of “underscore labels” for SRV records is called out as an example of a convention of permitting typically invalid labels in the last paragraph of RFC 5890 § 2.3.2.3. Those are “non-LDH” labels in the parlance of that RFC (and therefore not internationalizable, though it’s talking about the individual label there, not the overall domain). Searching for “NON-LDH label” there should get you to an ASCII diagram that can be helpful for getting a quick sense of the terms they’re using to classify labels & their subset/superset relations.

If I understand that right, I think a case can be made that UseStd3ASCIIRules=false should be used for Node’s DNS utils. I think one is meant to substitute an alternative check (like “UseStd3ASCIIRules, but modified to permit underscore as the first character” or something) if doing so rather than actually saying “everything is possible! what could go wrong”. You could probably make a counter-case that that isn’t worthwhile and supporting non-LDH labels is a can of worms that shouldn’t be facilitated. Not sure myself, but in any case, because ToASCII operates on domains, not individual labels, non-LDH labels are errors from its POV unless the UseStd3ASCIIRules=false customization escape hatch is used.

bathos avatar Aug 10 '22 03:08 bathos