The idn-* formats are problematic
I've been implementing the format assertions recently and I realized that the idn-* formats are problematic. The specification that's used for idn-hostname is known as IDNA2008. I've discovered that there are no JavaScript implementations. Normally, I wouldn't shy away from implementing it myself, but I've learned enough to know that any implementation would be prohibitively large (many times the size of my whole JSON Schema implementation). In most programming languages, including several hundred KB of unicode tables wouldn't be an issue, but it is in JavaScript.
There's another specification called UTS #46 and that's what all the browsers implement. It's a slightly lighter weight version of IDNA2008. That does have JavaScript implementations, but even the best ones are still prohibitively large and/or aren't complete.
So, there isn't a reasonable path to supporting idn-hostname in JavaScript. That means that there's no reasonable path for JavaScript implementations to support the format-assertion vocabulary which states, that "implementations MUST provide full validation support for all of the formats defined by this specificaion." So, implementations are required to reject schemas that use the format-assertion vocabulary even if the schema doesn't use the format it doesn't support.
The next release fixes that problem by changing the requirement to only reject the schema if it encounters a format it doesn't fully support, but there's still no reasonable path for JavaScript implementations to support the three IDNA2008-based formats. That begs the question, should we change the requirements for the idn-* keywords to make it less strict so every implementation can support it?
I said there were three formats that were affected. We talked about idn-hostname already. idn-email is affected because it specifies that the domain portion of the address is valid according to IDNA2008. Surprisingly, iri isn't affected. It doesn't limit its hostnames to IDNA2008 compatible values. The third one that's affected is hostname. In draft-07, it was added that hostnames converted to ASCII form (called A-Labels in IDNA2008) were valid hostnames as well as traditional hostnames. So, in order to fully validate hostname, implementations need to support IDNA2008.
Here are some proposals to consider. Not all of them are necessarily mutually compatible or mutually exclusive.
- Move all IDNA2008 dependent formats to the format registry. Although we encourage implementation of registry formats, it might be better to deprioritize formats that aren't universally supportable.
- Reduce the validation requirements for
idn-hostnameandidn-email. We could require the ABNF used for IRI hostnames as minimum requirements. - Revert
hostnameto its draft-06 definition when it didn't require checking that it's a valid IDNA2008 A-Label.
I think we need to do (3) and either (1) or (2).
I pretty much leave it up to .Net's built-in Uri type, though I haven't done any testing to see if it actually works.
I think 2 and 3 make the most sense to me.
I do support it - you can't quite leave it up to Uri in dotnet (there are few corner cases that don't work without a bit of fiddling).
I've also got a .NET implementation in the pipeline (Utf8Uri) that is API-similar to Uri, and does the full validation properly over the UTF-8 backing data and could be used independently by any .NET apps. It would also be fairly easily ported to other languages, as it has scripts for generating the necessary Unicode tables and the majority of the work is table lookups of various kinds.
(Which would all be a lot easier - though less easily ported if the .NET libraries didn't make all the low-level Globalization stuff internal, and even still interop out to platform-dependent libraries for some bits rather than use native .NET implementations.)
I do support it
👏 I'm impressed. I would be surprised if there were many others that do.
I pretty much leave it up to .Net's built-in Uri type
I've also got a .NET implementation in the pipeline (Utf8Uri) that is API-similar to Uri
I find it interesting that you're both using a URI class to validate hostnames. I noticed that the URI and IRI specs don't require any hostname-specific validation other than the range of characters that are allowed. So, I'm surprised that .Net's Uri class would do any hostname-specific validation. My understanding is that hostname validation is based on DNS rules and capabilities, but URI/IRI decouples from DNS. It doesn't require that the host part is a valid DNS hostname.
RFC 3986 Section 3.2.2: In other cases, the data within the host component identifies a registered name that has nothing to do with an Internet host. We use the name "host" for the ABNF rule because that is its most common purpose, not its only purpose.
I think 2 and 3 make the most sense to me.
I just want to make sure we're on the same page about what (2) would mean. We've been moving pretty heavily in the direction of consistent behavior for formats. That's the reason we require "full" validation and no longer allow for partial validation. If we set a minimum validation requirement, then implementations could have inconsistent behavior. Would it be better to define consistent requirements even if it doesn't fully validate the format it defines?
I'm inclined toward a minimum validation requirement in this case with a requirement that the level of validation be documented. I think the edge cases are not terribly likely to be hit by many people.
That leaves, what should the minimum require be? Here are the levels as I see them.
- Passes the
ihostproduction of RFC 3987 (IRI) - Passes length and
-placement limitations - Passes punycode decode/encode roundtrip
- Passes basic
UTS #46 - Passes
UTS #46with bidirectional rules validation - Passes
UTS #46with joiner rules validation - Passes IDNA2008
I think that covers it. I think requiring up to (3) is reasonable. (4) is where large tables start to be needed. I think everyone other than JavaScript implementations should be doing at least up to (4). According to ChatGPT, that's what's implemented in most places, but I can't verify that claim and know it's false at least for Node.js. Beyond that, it's really getting into the edges cases.
Ah - I'm being a bit stupid. I don't use URI for idn-hostname; I do decoding and then have custom validation code that validates all of the many and varied character class orderings that are or are not permitted.
bool isMatch;
if (value.StartsWith("xn--"))
{
string decodedValue;
try
{
decodedValue = Validate.IdnMapping.GetUnicode(value);
isMatch = !Validate.InvalidIdnHostNamePattern.IsMatch(decodedValue);
}
catch (ArgumentException)
{
isMatch = false;
}
}
else
{
try
{
Validate.IdnMapping.GetAscii(value);
isMatch = !Validate.InvalidIdnHostNamePattern.IsMatch(value);
}
catch (ArgumentException)
{
isMatch = false;
}
}
This, I believe, supports everything up to 6 in your list above.
And for VNext I have a version that avoids allocating strings, but is otherwise similar.
We got two votes in the slack thread for moving all IDNA2008 dependent formats to the registry and continuing to require full validation.
@jdesrosiers
I've discovered that there are no JavaScript implementations.
Actually there are few, most notably tr46.
Normally, I wouldn't shy away from implementing it myself, but I've learned enough to know that any implementation would be prohibitively large (many times the size of my whole JSON Schema implementation).
Yes, they are quite heavy, the one above has 78 installed dependencies, despite the fact that the compacted unicode data source plus the generated ones are only about 150KB. Anyway, it's fast, reliable, but... I didn't like it, is kind of ajv style, good for internal use, but not quite suitable for interfaces since it doesn't provide a high level of information on errors, mostly true or false.
For that reason I created my own idn-hostname and idn-email validators. By making them I learned that in idn-hostname area specs are quite precise, but in idn-email are quite permissive for the local part (the domain part is handled by idn-hostname entirely). I'm quite puzzled by idn-email specs to be honnest.
Hi @SorinGFS 👋
Actually there are few, most notably tr46.
I found tr46, but it's not an implementation of RFC 589[0-3] (IDNA2008), it's an implementation of UTS #46. The two specs are not entirely compatible. It was by far the best implementation I found of UTS #46, but that's not the spec 2020-12 requires and they're not 100% compatible. I didn't find any implementation of IDNA2008, it was all UTS #46. Also, I know 150KB doesn't sound like a lot, but that's three times the size of my library without it. I'm not willing to quadruple the size of my library to support a format that few people are ever going to use.
For that reason I created my own idn-hostname and idn-email validators.
Thanks for sharing! I had a quick look. This also looks like UTS #46, not full IDNA2008. As an example, tr46 doesn't implement all of the context rules registry. I could never figure out for sure if that was something incomplete in tr46 or if it wasn't required in UTS #46, but it is something required in IDNA2008 ~that it didn't look you you implemented either~ (Edit: nevermind, I see your README says you did implement it. I'll take a closer look at things later!).
Hi @jdesrosiers
I'm not willing to quadruple the size of my library to support a format that few people are ever going to use.
I totally agree with that. That's why I said long time ago (when you were talking to make format mandatory) that there are formats not suitable for the task because they cannot be enforced using just the basic capabilities of a programming language or framework. Anyway, my specs are quite different regarding formats, I removed the format keyword completly, and validations like idn-hostname are performed through extensions.
As an example, tr46 doesn't implement all of the context rules registry. I could never figure out for sure if that was something incomplete in tr46...
Yes, it's weird, and in addition to that, things that should be default are achieved through additional options. Without options I guess it remains pure uts46 mapping script. Also, I think they didn't implement correctly the CONTEXTJ rules (I guess that just by reading the code, I didn't performed a test comparison to see the differences). And, another thing that I think is wrong in their approach is that they released new versions in paralell with unicode, while currently supported version in JavaScript is 15.1.0. Supporting Unicode 16.0.0 (realeased september 2024) would require engines to update their internal Unicode tables (character properties, normalization, regex Unicode properties, etc) which will probably be available by the end of this year. For comparison TR46 has already released the version 17.0.0.
In my validator I implemented all the actual needs except backward compatibility (which I don't think anyone uses today, they are really ancient), so please take a look, and if you find anything that you (or anyone here) think I should implement I would be happy to update.
I've come to a better understanding of how this works from reading RFC 5895. IDNA2008 (RFC 589[0-3]) defines a framework for international domain names, but doesn't specify a complete implementation. One of the steps in the framework is to normalize string. That step is called "mapping". It does things like map upper case letters to lower case letters. However, that mapping depend on the context.
A simple and well-known example is the lowercasing of the letter LATIN CAPITAL LETTER I (U+0049) when it is used in the Turkish and other languages. A capital "I" in Turkish is properly lowercased to a LATIN SMALL LETTER DOTLESS I (U+0131), not to a LATIN SMALL LETTER I (U+0069). This lowercasing is clearly dependent on the locale of the system and/or the locale of the user. Using a single context-free mapping without considering the user interface properties has the potential of doing exactly the wrong thing for the user.
So, they don't define a mapping and expect others to define context-specific mappings. That may happen in some application, but it seems like the only mapping defined was UTS #46 which isn't locale aware, but just a general purpose mapping. It also seems to be what everyone implements.
The point is, because the spec says that these formats use IDNA2008, the mapping isn't defined. That means that different implementations could give different results if they use different mappings. In order to provide the consistency we want to have with formats, we need to specify a mapping and that means UTS #46.
So, if there's no objection, this is what I'd like to move forward with.
- Revert
hostnameto it's draft-06 definition that doesn't require validating IDNs. - Require full validation of
idn-hostnameandidn-emailas IDNA2008 using theUTS #46mapping. - Move
idn-hostnameandidn-emailto the format registry.
Passes punycode decode/encode roundtrip
After updating my library, I've found that it has a really hard time with punycode, specifically. I've opted not to support these for now.
In order to move forward with @jdesrosiers' suggestion, we'll need to put the format registry in place first.