Henri Sivonen

Results 373 comments of Henri Sivonen

If there is interest in interfacing with Python on that level instead of going via UTF-8, I guess that's a use case, then. Note: Python doesn't guarantee UTF-32 validity: The...

This should not be taken as an endorsement of UTF-32, but as a matter of how hard things would be for the collator specifically: > Segmenter and Collator have fine-tuned...

Since most strings don't contain supplementary-plane characters, supporting UTF-32 wouldn't really help: If most Python strings were converted to UTF-32 upon ICU4X API boundary, they might as well be converted...

With the infrastructure from #6674, it would be rather easy to introduce a `pyo3` feature that added `compare_py(&self, left: &PyStringData, right: &PyStringData)` using https://pyo3.rs/main/doc/pyo3/types/enum.pystringdata . Debatable whether ICU4X would be...

> Not particularly in favor of an optional pyo3 dep. > > Might be possible to do this as one method using generic encoding traits? The simplest thing in terms...

> Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support. AFAICT, adding UTF-32 support to the segmenter would be...

Supporting the segmenter with PyPy (and GraalPy?) might be more involved, though, if Python semantics require UTF-32 indices to be exposed but the data shown to ICU4X is UTF-8.

> > So to support CPython, if we don't want that micro optimization, we could add a utf32 feature that adds three methods in addition to the ones from [#6674](https://github.com/unicode-org/icu4x/pull/6674):...

> Which perf delta are you referring to as "tiny"? The delta between specializing for UCS2 vs. feeding UCS2 input to code that's prepared to handle the superset that is...

> return (Encoding1::empty(), left, right) This is worse than operating on the `char` level as last resort.