icu4x Reconsider UTF-32 support

The use of UTF-32 is rare enough that it's not worth supporting.

There is one significant use of UTF-32 in the real world: Python’s so-called ‘flexible string representation’. See PEP 393. The short version: Python internally stores strings as Latin-1 if they only contain characters ≤ U+00FF; as UTF-16 (guaranteed valid, fwiw) if they contain only characters in the BMP; or otherwise as UTF-32. This is intended to provide the most efficient representation for a majority of strings while retaining O(1) string indexing — it’s much like what the document says about what SpiderMonkey and V8 do, but since Python string indexing, unlike JS string indexing, returns real codepoints and not UTF-16 code units, it adds an extra upgrade to UTF-32.

In the Scheme world, R7RS Large can reasonably be expected to require that codepoint indexing into strings (or some variant of strings — it’s possible we’ll end up with a string/text split like Haskell’s) be O(1), so I expect UTF-32 or Python-style flexible string representation to become common in that context, too.

(Also, before flexible string representation was introduced into Python, UTF-32 was used for all strings.)

Mar 14 '21 17:03 dpk

CC @hsivonen

Mar 14 '21 21:03 sffc

If there is interest in interfacing with Python on that level instead of going via UTF-8, I guess that's a use case, then.

Note: Python doesn't guarantee UTF-32 validity: The 32-bit-code-unit strings can contain lone surrogates, so if the use case is interfacing with Python without UTF-8 conversion, ICU4X would need to check each code unit for being in the range for Rust char instead of assuming validity.

Mar 15 '21 15:03 hsivonen

I would hypothetically be interested in writing a Python binding once icu4x has a C API, as an alternative to the very un-Pythonic and under-documented PyICU. (I’m the author of the PyICU cheat sheet, which is afaik the only API documentation specific to ICU in Python — otherwise, you’re just referred to the C++ API and left to work out how it maps on to Python for yourself.)

Mar 15 '21 15:03 dpk

Author of PyICU here: if you find PyICU very unpythonic, please provide concrete examples about how you're doing something with PyICU and how you'd suggest it be done instead in order be more pythonic. I'm happy to either fix actual un-pythonic examples of PyICU use-cases or show you how it's done. Please, be very specific, there already is a lot of built-in python iterator support, for example, that you may just not know about. It's ok to ask and suggest improvements (!) I'm pretty sure that PyICU being very under-documented also contributes to your perceiving it as un-pythonic. I claimed PyICU documentation bankrupcy over a decade ago as the ICU API surface is huge and keeps growing. I cannot provide another set of docs, it's hard enough to keep up with ICU proper and the ICU docs themselves are pretty good. This is open source, the PyICU C++ wrappers around C/C++ICU are fairly regular, I encourage you to read the code to see what is possible and what is supported and how to use it.

Mar 15 '21 18:03 ovalhub

For example, from your cheat sheet, you seem to not know that BreakIterator is a python iterator:

   from icu import *
   de_words = BreakIterator.createWordInstance(Locale('de_DE'))
   de_words.setText('Bist du in der U-Bahn geboren?')
   list(de_words)
   [4, 5, 7, 8, 10, 11, 14, 15, 16, 17, 21, 22, 29, 30]

Yes, I understand, you'd prefer the actual words to be returned but that's not how ICU designed the BreakIterator, they chose to give you boundaries instead. It's not that hard, in python, to then combine these boundaries into words, however. That being said, but this might become a lot of work to do consistently, adding a higher level iterator giving you the words, would be nice too.

Mar 15 '21 19:03 ovalhub

We will revisit string encodings as we approach ICU4X v1.

Mar 18 '21 18:03 sffc

Discussion:

@Manishearth - ICU4X is powered by Diplomat, which allows the target to set the encoding for output strings (via Writeable). So the question is mainly scoped to input strings and text processing APIs like Segmenter.
@robertbastian - Should we consider returning types of iterators in Segmenter that abstracts over the string encoding?
@sffc - Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support. We could do it if it is well-motivated.
@robertbastian - Who else needs UTF-32? Are there numbers on the use?
@Manishearth - Ruby supports bring-your-own-encoding. Most clients use either UTF-8 or UTF-16. A "rope-based" character encoding was useful in XI, which allowed for incremental segmentation (re-segmenting text after being edited without re-computing the whole string).
@sffc - UTF-32 is useful as a code point storage mechanism, like &[char]. I haven't seen it used widely as an encoding for strings.
@robertbastian - Based on @hsivonen's comment above, Python does not guarantee validity of UTF-32, so maybe this should just be handled in the Python FFI layer.
@Manishearth - Since it is Python, maybe we just do an in-place conversion.
@robertbastian - We should punt the decision until we add Python FFI support.
@sffc - Agreed.

May 20 '22 17:05 sffc

This should not be taken as an endorsement of UTF-32, but as a matter of how hard things would be for the collator specifically:

Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support.

The collator and the decomposing normalizer consume an iterator over char internally (with errors mapped to U+FFFD), so adding UTF-32 support would be trivial. At compile time, there would be separate codegen instances for UTF-32, which would grow the binary size, but those instances should also be eligible to be thrown away by LTO when not used (except for FFI, there's currently a Rust symbol visibility issue standing in the way of cross-language LTO doing proper dead code analysis).

May 23 '22 07:05 hsivonen

Since most strings don't contain supplementary-plane characters, supporting UTF-32 wouldn't really help: If most Python strings were converted to UTF-32 upon ICU4X API boundary, they might as well be converted to UTF-8 unless there are indices returned.

Indices are relevant to the segmenter. In that case, it might actually help Python to convert to UTF-32 and then segment that.

Other than that, the case where avoiding conversion to UTF-8 might make sense is the collator, which performs a lot of string reading without modification. However, to have the collator operate without having to create (converted) copies of the Python string data, there'd need to be 6 methods:

Compare potentially-ill-formed UTF-32 and potentially-ill-formed UTF-32.
Compare UCS-2 and UCS-2.
Compare Latin 1 and Latin 1.
Compare potentially-ill-formed UTF-32 and UCS-2.
Compare potentially-ill-formed UTF-32 and Latin 1.
Compare UCS-2 and Latin 1.

The remaining of the nine cases are mirror cases of the last three, so no point in generating code for those separately.

Note that a surrogate pair in a Python string has the semantics of a surrogate and another surrogate. The result does not have supplementary-plane semantics. I haven't checked if surrogates promote to 32-bit code units or if the 16-bit-code-unit representation can have surrogates that don't have UTF-16 semantics. That is, it's unclear to me if item 2 can reuse the UTF-16 to UTF-16 comparison specialization.

Note that the raw Python data representation is available via PyO3 "on non-Py_LIMITED_API and little-endian only".

If someone really cares, it would make sense to benchmark the collator with these 6 variants (out-of-repo) vs. converting to UTF-8 and then using the &str-to-&str comparison.

Nov 21 '22 17:11 hsivonen

With the infrastructure from #6674, it would be rather easy to introduce a pyo3 feature that added compare_py(&self, left: &PyStringData, right: &PyStringData) using https://pyo3.rs/main/doc/pyo3/types/enum.pystringdata .

Debatable whether ICU4X would be OK with a feature that would make ICU4X refer to PyStringData as opposed to a feature that would expose the six methods from previous comment so that the PyStringData reference could be in an external crate.

Jun 16 '25 08:06 hsivonen

Not particularly in favor of an optional pyo3 dep.

Might be possible to do this as one method using generic encoding traits?

Jun 16 '25 17:06 Manishearth

Can Python use Latin-1 for UCS1 and UTF-16 for UCS2? We already have precedent for Latin-1 and UTF-16 in ICU4X Segmenter. Then we could just add UTF-32.

Jun 16 '25 21:06 sffc

@sffc I don't understand, are you proposing we ask Python to change its internal string representation?

Jun 16 '25 22:06 Manishearth

No, I'm asking whether Python UCS1 can be fed into ICU4X Latin-1 APIs and whether Python UCS2 can be fed into ICU4X UTF-16 APIs, in which case we don't need to plug a pyo3 dep into the ICU4X core library.

Jun 16 '25 22:06 sffc

@sffc Ah.

I think that should be fine, since Python will only choose an encoding if the string can be represented in it, and "valid UCS2" is a subset of UTF-16. There is no UCS-1.

That doesn't really solve the problem, though; it only gets rid of one comparison endpoint (endpoint2: ucs2 vs ucs2). We still need 4 new comparison functions on top of the 3 we already have.

I think writing a pluggable compare function that supports having different left and right encodings is the way to go here. We might lose out on some of the optimizations but that's fine, that's what the specialized ones are for.

Jun 16 '25 22:06 Manishearth

Oh I see, I missed the part about mixing encodings.

Yeah, if there's an easy way to do it, traits seem to make sense here.

Jun 17 '25 00:06 sffc

Not particularly in favor of an optional pyo3 dep.

Might be possible to do this as one method using generic encoding traits?

The simplest thing in terms of the size of the public API that could work would be to have a cpython feature (looking at PyO3, PyPy and GraalPy represent strings differently and should go through UTF-8) on icu_collator that wouldn't bring new deps but would have an enum that has the same shape as PyO3's PyStringData, and then an external crate would have to map from one enum to another and hope that it optimizes away.

Or we could expose a larger number of methods (behind a Cargo feature) and let the external crate deal with deciding which ones to call:

Can Python use Latin-1 for UCS1 and UTF-16 for UCS2? We already have precedent for Latin-1 and UTF-16 in ICU4X Segmenter. Then we could just add UTF-32.

PyStringData::Ucs1 is the same as SpiderMonkey/V8 Latin1, so, yes, the Latin1 stuff works as-is for that case.

PyStringData::Ucs2 is a subset of what our _utf16 methods take, so our _utf16 methods would work albeit not quite optimally compared to monomorphizing an exactly sufficient version that doesn't check for surrogate pairs and always just does u16_slice.iter().map(|&code_unit| char::from_u32(u32::from(code_unit)).unwrap_or(REPLACEMENT_CHARACTER)).

So to support CPython, if we don't want that micro optimization, we could add a utf32 feature that adds three methods in addition to the ones from #6674:

compare_utf32(&self, left: &[u32], right: &[u32]) (that does left.iter().map(|&code_unit| char::from_u32(code_unit).unwrap_or(REPLACEMENT_CHARACTER))); CPython does not guarantee valid scalar values, so assuming char without checking the range would not be OK).
compare_utf16_utf32(&self, left: &[u16], right: &[u32])
compare_latin1_utf32(&self, left: &[u8], right: &[u32]) (if both latin1 and utf32 Cargo features are active)

This way, a utf32 Cargo feature wouldn't be strictly tied to either PyO3 or CPython, though the only realistic potential consumer would be a PyO3 wrapper for icu_collator and only in the CPython (not PyPy or GraalPy) mode.

Jun 17 '25 06:06 hsivonen

Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support.

AFAICT, adding UTF-32 support to the segmenter would be easy, too:

Addition of _utf32 entry points and internally making the iterator a copypaste of Latin1Indices with u8 replaced with u32.

Jun 17 '25 12:06 hsivonen

Supporting the segmenter with PyPy (and GraalPy?) might be more involved, though, if Python semantics require UTF-32 indices to be exposed but the data shown to ICU4X is UTF-8.

Jun 17 '25 13:06 hsivonen

So to support CPython, if we don't want that micro optimization, we could add a utf32 feature that adds three methods in addition to the ones from https://github.com/unicode-org/icu4x/pull/6674:

@hsivonen hold on, I think there are still a fair number of assumptions about microoptimizations here: why must the API be specialized this way? I understand there may be performance benefits of specializing the API like that, but we should talk explicitly about them, and potentially benchmark them before deciding on exploding the API that way. Again, a generic compare<Encoding1, Encoding2>() API is certainly possible.

I suspect a compare API that takes in generic encodings is going to lose out on some of the ways we can write code specifically for an encoding, but it should not be significantly slower.

Personally my position is thus: If we decide to fix this problem, we should try to write a generic compare() that operates on encodings. That is useful to expose anyway. We should thenbenchmark it against specialized functions we care about, and see if the performance is significantly different. Then we should make a decision on adding a single API or multiple.

I'm very wary of the "multiple APIs" route because this is the type of thing that explodes quickly. CPython has a set of encodings it cares about, but so does V8 (we'd need utf8-utf16 mixing). Spidermonkey may as well. Language runtimes make a lot of different choices here, I am very reluctant to go down this path. I see nothing special about CPython here, this is not "just" adding 3 new APIs.

Jun 17 '25 16:06 Manishearth

I think we only need 4 encodings in ICU4X itself: Latin-1, potentially ill-formed UTF-8, potentially ill-formed UTF-16 (with unpaired surrogates), and potentially ill-formed UTF-32. Any encodings more strict than this can be fed into the less strict API. So I don't see this as an unbounded explosion.

Jun 17 '25 16:06 sffc

@sffc The APIs in questions take in two strings. This may not be unbounded, but this is an explosion, and I don't think it's controversial that the "bound" of sixteen APIs is quite large and I definitely do not wish to go down a path that commits us to that.

Jun 17 '25 16:06 Manishearth

It isn't 4^2, it is 4 + 4 choose 2, which is 8.

Jun 17 '25 17:06 sffc

(⁴C₂ is 6, so that number is 10)

That's a bit more okay, but still large.

Jun 17 '25 17:06 Manishearth

So to support CPython, if we don't want that micro optimization, we could add a utf32 feature that adds three methods in addition to the ones from #6674:

@hsivonen hold on, I think there are still a fair number of assumptions about microoptimizations here: why must the API be specialized this way?

Regarding the UCS2 micro optimization compared to potentially-ill-formed UTF-16: No must; the optimization would be quite micro.

Specialization generally: Avoiding allocations for conversions (particularly when the same string participates in multiple comparisons and in naive glue code would get converted every time) and avoiding the conversion themselves.

I understand there may be performance benefits of specializing the API like that, but we should talk explicitly about them, and potentially benchmark them before deciding on exploding the API that way. Again, a generic compare<Encoding1, Encoding2>() API is certainly possible.

I suspect a compare API that takes in generic encodings is going to lose out on some of the ways we can write code specifically for an encoding, but it should not be significantly slower.

The main concern with compare<Encoding1, Encoding2>() is getting an identical prefix skipping function that is appropriate for the pair.

For example, a strictly CPython-oriented Ucs2 to Ucs4 comparison could use an identical prefix skipping function that is of a simpler shape, but an actual potentially-ill-formed UTF-16 to (potentially ill-formed or not) UTF-32 would need to have a slightly more subtly complicated shape.

My point here isn't the tiny, tiny perf delta between the two but that the more complicated shape should probably be hand-written instead of expecting it to arise automatically from a trait combination.

I'm very wary of the "multiple APIs" route because this is the type of thing that explodes quickly. CPython has a set of encodings it cares about, but so does V8 (we'd need utf8-utf16 mixing).

That kind of UTF-8 use in V8 is news to me. If V8 needs it, we could add that case as well, but that case's identical prefix skipping approach would differ from the ones that are already there or that accommodating CPython would require, so it illustrates that we should hand-tune identical prefix function for the cases that we want instead of expecting the right identical prefix code to fall out of a trait combination.

Spidermonkey may as well.

What SpiderMonkey needs just landed for the collator (behind a Cargo feature), and we've long ago accepted to accommodate SpiderMonkey's needs in the segmenter. Before the above mention of comparing UTF-8 to UTF-16 in V8, I thought V8's needs were identical to SpiderMonkey's here.

Jun 17 '25 17:06 hsivonen

Specialization generally: Avoiding allocations for conversions (particularly when the same string participates in multiple comparisons and in naive glue code would get converted every time) and avoiding the conversion themselves.

I am not suggesting conversion allocations

The main concern with compare<Encoding1, Encoding2>() is getting an identical prefix skipping function that is appropriate for the pair.

Right, the identical prefix skip is firmly in optimization territory: and this is an optimization we can even incrementally perform if we need to after writing the API.

Specialization doesn't exist in Rust, but in generic code like this you can still do if (type_of::<T>() == type_of::<MyEncoding>()) style checks and have them correctly monomorphize.

This is all freedom we retain with a generic API.

My point here isn't the tiny, tiny perf delta between the two but that the more complicated shape should probably be hand-written instead of expecting it to arise automatically from a trait combination.

Which perf delta are you referring to as "tiny"? If it is the one I am talking about, then I don't understand the rest of this sentence: if it is a tiny perf delta why do we want to spend complexity on this?

The perf delta I am talking about is with and without the identical prefix optimization, which I think you are talking about as the complexity. How important is it? Do we have numbers on it? Those numbers can inform us if Spidermonkey, V8, and/or CPython need to care about different combinations.

And as I said above, a generic compare can be written in a way that has some manual dispatch and benefits from identical prefix optimizations. We can have a method where over time we guarantee more and more optimal pairs of encodings, but it does support everything.

we've long ago accepted to accommodate SpiderMonkey's needs in the segmenter

How so? I don't think there's a blanket "we accept arbitrarily optimized APIs to fit Spidermonkey's needs" agreement: we make sure all functionality is exposed, but optimizations can be a different realm where we need to make sure it's worth it.

Jun 17 '25 18:06 Manishearth

Concretely; here is how a generic API would work: https://gist.github.com/Manishearth/a6be31bfbd120e0ebfde8d23b94b072e . I've copied in the existing split prefix impls, and written a stub compare function, but the general idea is the same.

The core of it is this compare function, which calls a generic split prefix function.

pub fn compare<Encoding1: Encoding, Encoding2: Encoding>(
    left: &Encoding1::Buf,
    right: &Encoding2::Buf,
) -> Ordering {
    // Not using head in this toy impl
    let (_head, left_tail, right_tail) = split_prefix_generic::<Encoding1, Encoding2>(left, right);
    if Encoding1::is_empty(left_tail) && Encoding2::is_empty(right_tail) {
        return Ordering::Equal;
    }

    compare_inner(Encoding1::chars_for(left_tail), Encoding2::chars_for(right_tail))
}

The generic split prefix function does this. All branches in this function are known at compile time:

fn split_prefix_generic<'a, 'b, Encoding1: Encoding, Encoding2: Encoding>(
    left: &'a Encoding1::Buf,
    right: &'b Encoding2::Buf,
) -> (&'a Encoding1::Buf, &'a Encoding1::Buf, &'b Encoding2::Buf) {
    if let (Some(left), Some(right)) = (Encoding1::as_u16(left), Encoding2::as_u16(right)) {
        let (h, l, r) = split_prefix_u16(left, right);
        return (Encoding1::from_u16(h), Encoding1::from_u16(l), Encoding2::from_u16(r));
    } else if let (Some(left), Some(right)) = (Encoding1::as_str(left), Encoding2::as_str(right)) {
        let (h, l, r) = split_prefix_str(left, right);
        return (Encoding1::from_str(h), Encoding1::from_str(l), Encoding2::from_str(r));
    } else if let (Some(left), Some(right)) = (Encoding1::as_u8(left), Encoding2::as_u8(right)) {
        if TypeId::of::<Encoding1>() == TypeId::of::<PIFUtf8>() && TypeId::of::<Encoding2>() == TypeId::of::<PIFUtf8>() {
            let (h, l, r) = split_prefix_u8(left, right);
            return (Encoding1::from_u8(h), Encoding1::from_u8(l), Encoding2::from_u8(r));
        } else if TypeId::of::<Encoding1>() == TypeId::of::<Latin1>() && TypeId::of::<Encoding2>() == TypeId::of::<Latin1>() {
            let (h, l, r) = split_prefix_latin1(left, right);
            return (Encoding1::from_u8(h), Encoding1::from_u8(l), Encoding2::from_u8(r));
        }
    } else if let (Some(left), Some(right)) = (Encoding1::as_u8(left), Encoding2::as_u16(right)) {
        if TypeId::of::<Encoding1>() == TypeId::of::<Latin1>() && TypeId::of::<Encoding2>() == TypeId::of::<Utf16>() {
            let (h, l, r) = split_prefix_latin1_utf16(left, right);
            return (Encoding1::from_u8(h), Encoding1::from_u8(l), Encoding2::from_u16(r));
        }
    }

    return (Encoding1::empty(), left, right)
}

Note that at the moment it will optimize (Latin1, Utf16) but not the other way around, this can be tweaked if we return a (double ended) chars iterator for the head instead of returning a buffer. The generic compare impl does not need more than that.

I'm happy to try and make a PR for this if desired. I'd like to keep it experimental for now, though.

Jun 17 '25 18:06 Manishearth

Which perf delta are you referring to as "tiny"?

The delta between specializing for UCS2 vs. feeding UCS2 input to code that's prepared to handle the superset that is potentially-ill-formed UTF-16.

The perf delta I am talking about is with and without the identical prefix optimization, which I think you are talking about as the complexity. How important is it? Do we have numbers on it? Those numbers can inform us if Spidermonkey, V8, and/or CPython need to care about different combinations.

In #6496 I wrote:

On the existing test suite, the identical prefix optimization itself improves perf from 1% to 12%, generally around 7%. (Additionally, this changeset includes a move avoidance change that improves perf even more.)

The additional bench from https://github.com/unicode-org/icu4x/pull/6495 is improved by around 85%.

The "existing test suite" was about sorting names of persons. The names were such that Latin and Cyrillic didn't really have identical prefixes of interest but Chinese and Korean did.

The additional bench simulated sorting a directory listing of photos named according to datetime taken.

So overall: How good the optimization is depends on the workload: the level of prefix commonalities in the workload. We still don't have actual workloads from actual apps as benches.

I have not measured what the impact would be if the identical prefix skipping wasn't optimized on the code unit level and instead worked only on the char level. Operating on the char level has to be slower, so it would be bad pessimize the already code-unit-level optimized cases in order to be generic in a way that would look nicer in a public API. (The current pluggability makes a lot of sense for an internal API.)

For comparing UTF-8 with UTF-16, it would make sense to write the identical prefix skipping by decoding both to the char level, though. I suppose that could be then exposed as a generic entry point, but I'm more worried about committing to a generic public API than about having a larger number of named methods behind Carge features.

A generic public API where the level of generality is that you have to show how to get a DoubleEndedIterator<Item = char> + Clone from the left slice and Iterator<Item = char> + Clone from the right slice would probably not be too scarily general to commit to, though.

To generalize away the slices to only take two iterators with enough bounds would also involve a bound that would enable truncating the DoubleEndedIterator<Item = char> + Clone so that it only covers the range up to the position of a clone of itself that has been advanced.

Technically, we could also expose a public API where the caller brings the code-unit-level pairwise identical prefix skipping function of the same shape that we now have internally, but as far as correctness review goes, those things look like they belong in the domain expertise-encapsulating library and not to the caller.

It would also be possible to try to generalize the Latin1, UTF-16, UTF-32 hierarchy, but it would just complicate things, when we already have the Latin1 + UTF-16 part done, and adding UTF-32 would involve adding 3 methods of the same shape that we already have the internal macro+generic function infra for.

we've long ago accepted to accommodate SpiderMonkey's needs in the segmenter

How so?

I meant that the Latin1 entry points are there.

Jun 17 '25 18:06 hsivonen

return (Encoding1::empty(), left, right)

This is worse than operating on the char level as last resort.

Jun 17 '25 19:06 hsivonen

Note that at the moment it will optimize (Latin1, Utf16) but not the other way around

Since the prefix is taken from left, it's desirable to have have the encoding that decodes (backwards) to char more efficiently as the left encoding and to handle to opposite case by swapping left and right and reversing the returned Ordering.

I can see that your approach could work, though I'd hope the fallback case to be char-level prefix finding that would be needed for comparing UTF-8 with a different encoding anyway, but I'm not currently convinced that the complication is worthwhile to avoid explicitly-named methods generated by the current macro scheme that could be expected to remain behind off-by-default Cargo features.

Jun 17 '25 19:06 hsivonen