icu4x
icu4x copied to clipboard
Move CanonicalCombiningClass type out of icu_properties
The data for CanonicalCombiningClass is available both via icu_properties and via icu_normalizer. To have the same type in both cases, icu_normalizer depends on icu_properties. This prevents the two from being compiled in parallel. See https://github.com/servo/rust-url/issues/939
We should break the dependency. The obvious thing to do is to move the type to icu_normalizer and not make the data available via icu_properties at all. If we really wanted to have two ways of providing the data, the type should go into a third crate, but there isn't a candidate destination that wouldn't be weird.
Hmm, optimizing for crates compiling in parallel isn't something I think we've looked at closely.
...and this is harder than it first appears, since we support named access of property data for the benefit of regular expression engines, and canonical combining class is one of the properties for which access by name is supported.
I think it's bad to have the data in two places, but having the data only normalizer and having named access in properties would break build parallelism due to an opposite dependency. This implies putting named access in a separate crate.
The current build parallelism issue could be addressed by moving the type to a third crate while keeping two instances of the data. Kinda weird to have a one-off crate just for CanonicalCombiningClass. Potentially excessive to have a separate crate for all the property types.
(Not sure if the meeting minutes on this issue ended up on GitHub somewhere.)
It occurred to me that the approach discussed in the meeting of making the default features of icu_normalizer depend on icu_properties but making it possible for idna and icu_harfbuzz to opt out the default features to use a u8-returning API isn't that great as a built parallelism solution: The moment the application includes a dependency on icu_normalizer without opting out of default features, icu_normalizer would stop building in parallel with icu_properties even if the added dependency on icu_normalizer didn't care about the canonical combining class API at all (after all, the motivation for that API existing is icu_harfbuzz and the other currently known user is idna 1.0).
To not make it too easy to break build parallelism, it seems like a typed getter for canonical combining class should be a non-default feature (which would go against the current ICU4X design approach of not having to enable features for getting API surface).
June 27
- @hsivonen - We have these small struct wrappers in icu_properties. If we moved those to separate smaller crate we could parallelize builds better.
-
- Is it only the struct?
- @hsivonen - The struct and the functions on it.
-
- Put it in
icu_provider, with doc hidden? We do this for some locale fallback stuff.
- Put it in
-
- I don't think we need to go that far. The reason is CanonicalCombiningClassMap; it's not even used in the normalizer, an extra thing for harfbuzz.
- @hsivonen - Yes, and idna. The reason the CCC data is also in the normalizer is that it lives in the same trie as the normalization stuff.
-
- But the CCCMap is not used by idna
- @hsivonen - I think it is? Check the latest version. If you have a ZWJ, that's allowed if the CCC of the previous character is a virama.
-
- I don't see it.
-
- The only two APIs where this matters are the actual APIs that get you values of type CCC. We could, like, we have a
get_u8function and it gives you a u8 and then we have feature-flaggedget_CCCthat return the fancier types. Theicu_propertiesCCC has name mapper, named constants, etc.
- The only two APIs where this matters are the actual APIs that get you values of type CCC. We could, like, we have a
- @sffc - We could have CCC that lives in both crates, with conversion between them. But I like the
get_u8approach. I don't think we should polluteicu_provider. - @hsivonen - I think what is a nicer user experience
-
- Tradeoff is all the methods in CCC would no longer be methods in CCC. It has name getter methods. That's half the benefit of a standalone type.
- @sffc - Open to the idea of an icu core crate (doesn't do anything). Quite a number of things to shove in there, including CCC. We've so far avoided it, it's probably best to continue to avoid it. I don't think there's any harm in a
get_u8function - @hsivonen - Do they use the constants?
-
- I think collator but not normalizer.
- We can have a test that asserts constants are the same.
- @sffc - We want to keep the problem internal to our crates
-
- I think there's a valid reason to split the properties crate in two features. This ties into the next topic. We can only reduce
syndependencies from certain things, like map and set getters. Alsozerovec. This is where it starts getting tricky. If we want to makeicu_propertiesfree of zerovec-derive, do we need to add more gnarly things? That would be feature flagging, which could include maps and sets by default (except maybe compile times of baked data?), and then other things can be added in. It's potentially worth investigating. Just pointing out this has valid use cases outside of CCC.
- I think there's a valid reason to split the properties crate in two features. This ties into the next topic. We can only reduce
Consensus:
- If desired, we can add
get_u8andget32_u8to the CCCMap inicu_normalizer, allowing us to feature-gateicu_properties. Other changes, such as reducing constants, are internal. - It would be a default feature, so it can be done post 2.0.
- Can document choices on the APIs
LGTM: @sffc @younies (@hsivonen)
I'm fine making it be a non-default feature.
Me too
Though in that case we need to make the change now
@robertbastian Do you agree with the above conclusion?
Note: When doing this, we should test the effect on the special cases in the Pernosco debugger. If this breaks them, we should make the feature gate for icu_properties also restore the internal types to match what Pernosco expects so that to debug, you'd opt into to icu_properties without having to ask for Pernosco to be adjusted (again) for ICU4X debugging.
sgtm