node
node copied to clipboard
Punycode decoding is not working in rare cases
Version
16.5.0
Platform
various linux distros tried
Subsystem
No response
What steps will reproduce the bug?
On some IDNs a correct punycoded domain will break url.domainToUnicode()
and thus other parts within nodejs like dns.lookup()
and such.
The domain in question is xn--mgba6a0dd.com
which on its own as such works, however if there are subdomains with a combination of .
+_
the API returns an empty string, which should not be the case from my knowledge as well as https://www.punycoder.com/ deals with all of them correctly.
To reproduce via the node repl:
> require('url').domainToUnicode('foo._bar.xn--mgba6a0dd.com')
''
> require('url').domainToUnicode('foo.bar.xn--mgba6a0dd.com')
'foo.bar.قازاق.com'
> require('url').domainToUnicode('foo_bar.xn--mgba6a0dd.com')
'foo_bar.قازاق.com'
How often does it reproduce? Is there a required condition?
No response
What is the expected behavior?
No response
What do you see instead?
No response
Additional information
No response
@nodejs/dns
The behavior described here appears consistent with UTS #46’s ToASCII
with UseStd3ASCIIRules
set to true
. It is “strongly recommended” to do so there, but the reason the parameter can be false in the first place is to leave room for cases where the range of acceptable labels is broader than it “should” be due to convention or web-reality exceptions (see STD3 Rules).
The use of “underscore labels” for SRV records is called out as an example of a convention of permitting typically invalid labels in the last paragraph of RFC 5890 § 2.3.2.3. Those are “non-LDH” labels in the parlance of that RFC (and therefore not internationalizable, though it’s talking about the individual label there, not the overall domain). Searching for “NON-LDH label” there should get you to an ASCII diagram that can be helpful for getting a quick sense of the terms they’re using to classify labels & their subset/superset relations.
If I understand that right, I think a case can be made that UseStd3ASCIIRules=false
should be used for Node’s DNS utils. I think one is meant to substitute an alternative check (like “UseStd3ASCIIRules, but modified to permit underscore as the first character” or something) if doing so rather than actually saying “everything is possible! what could go wrong”. You could probably make a counter-case that that isn’t worthwhile and supporting non-LDH labels is a can of worms that shouldn’t be facilitated. Not sure myself, but in any case, because ToASCII operates on domains, not individual labels, non-LDH labels are errors from its POV unless the UseStd3ASCIIRules=false
customization escape hatch is used.