icu4x
icu4x copied to clipboard
Provide UTS 46 mapping / disallow operations fused with normalization
Gecko implements most of its IDNA processing on its own, but it currently uses ICU4C's uidna_labelToUnicode for UTS 46 processing.
Code that wraps ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable and augments it to provide the functionality of ICU4C's uidna_labelToUnicode needs to go somewhere: Since it needs some data for ignored and disallowed that is in sync with ICU4X data, it would make sense to start a new ICU4X component for it. (We already reserve the name.)
Should we even provide transitional processing, since it looks like the transition period is finally ending?
What timing with the Mustafa email!
As discussed in https://github.com/unicode-org/icu4x/issues/42, I prefer for IDNA to be its own crate that can be powered with ICU4X data. I think the unicode_bidi solution is an appropriate model to use here.
This is a concern largely driven by governance: ICU4X is already a large project, and I don't wish to take ownership of components for which we don't have sufficient knowledge to maintain. We accepted Collator and Normalizer because they are integral to ECMA-402; this required a fairly substantial effort between yourself and @echeran as co-owner, and future changes will require one of you as a reviewer.
After exploring the Rust crate space around this topic a tiny bit, I'm not particularly happy to find how much code is needed to match uidna_labelToUnicode given the existence of ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable. Still, ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable is a very odd primitive to provide without providing the data for 'disallowed' and 'ignored'.
How about this: We put a data struct carrying the disallowed & ignored data in a crate called icu_idna and then make the crate provide an iterator adapter that consumes an iterator over char, exposes an iterator over Result<char, IdnaError>, and in between the input and output performs the combination of the "Map" and "Normalize" steps of the "Processing" part in UTF 46?
This would put all the relevant data into ICU4X but would leave Punycode and the rest out of ICU4X for now. (ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable is designed for this and it would be really weird to have to get the disallowed & ignored data from outside ICU4X.)
@markusicu @echeran can you provide input here? What sensible primitive can ICU4X expose that enables IDNA to be implemented externally?
Note: My assumption is that UTS 46 section 4.1 Validity Criteria items 1 and 6 would be checked by applying the Map & Normalize primitive proposed above and checking the result for error or equality with input. This means that in Processing step 4 xn-- case, the Map & Normalize primitive would run again and int he non-xn-- case items 1 and 6 would already be OK by construction.
Please let me know if I've missed something and this wouldn't hold.
Oh, and the data needs to support flagging disallowed-if-UseSTD3ASCIIRules=true (and ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable needs to reflect UseSTD3ASCIIRules=false).
We need a trie with two bits per scalar value. We could pack the bits for four scalar values in one u8 trie value by shifting the code point value right by two before lookup and using the two least-significant bits times two to decide how much to shift the trie value right before taking the two least significant bits as a four-item enum:
- pass through (default)
- ignore
- disallowed
- disallowed if STD3
Ping @markusicu @echeran
Current notes:
- The data contemplated earlier should probably go into
icu_normalizerso thaticu_normalizercan expose a coherent operation to an IDNA crate instead of exposing a weird normalization that requires an extra preprocessing step to make sense. - It turns out that in the WHATWG URL Standard, STD3 enforcement is only for checking conformance of content but irrelevant to non-conformance-checker URL consumers. (In particular, Gecko doesn't appear to ever enable STD3 enforcement.)
- How much will it cost to carry about the STD3 data when it's not actually used?
From merely looking at the data with knowledge of the relevant data structure, making everyone carry the STD3 disallowed info does not seem particularly nice, but I haven't actually measured, yet.
Now that I've looked at this some more, it seems to me that It's a bad idea to check the input scalar values for being STD3-disallowed and instead of makes sense to check the output for STD3-prohibited ASCII if the STD3 check is in effect. Am I missing something?
About UseSTD3ASCIIRules:
- Check for STD3-disallowed after mapping.
- You don't need two versions of the data. Just use the non-STD3 data and then check for STD3-disallowed ASCII characters.
- In an ASCII-only fastpath, you can fuse the mapping & checking, including for STD3-disallowed ASCII if UseSTD3ASCIIRules=true.