Henri Sivonen comments

Results 373 comments of


                                            Henri Sivonen

Reconsider UTF-32 support

If there is interest in interfacing with Python on that level instead of going via UTF-8, I guess that's a use case, then. Note: Python doesn't guarantee UTF-32 validity: The...

Reconsider UTF-32 support

This should not be taken as an endorsement of UTF-32, but as a matter of how hard things would be for the collator specifically: > Segmenter and Collator have fine-tuned...

Since most strings don't contain supplementary-plane characters, supporting UTF-32 wouldn't really help: If most Python strings were converted to UTF-32 upon ICU4X API boundary, they might as well be converted...

Reconsider UTF-32 support

With the infrastructure from #6674, it would be rather easy to introduce a `pyo3` feature that added `compare_py(&self, left: &PyStringData, right: &PyStringData)` using https://pyo3.rs/main/doc/pyo3/types/enum.pystringdata . Debatable whether ICU4X would be...

Reconsider UTF-32 support

> Not particularly in favor of an optional pyo3 dep. > > Might be possible to do this as one method using generic encoding traits? The simplest thing in terms...

Reconsider UTF-32 support

> Segmenter and Collator have fine-tuned code paths for UTF-8 and UTF-16, so it's not necessarily trivial to add UTF-32 support. AFAICT, adding UTF-32 support to the segmenter would be...

Reconsider UTF-32 support

Supporting the segmenter with PyPy (and GraalPy?) might be more involved, though, if Python semantics require UTF-32 indices to be exposed but the data shown to ICU4X is UTF-8.

Reconsider UTF-32 support

> > So to support CPython, if we don't want that micro optimization, we could add a utf32 feature that adds three methods in addition to the ones from [#6674](https://github.com/unicode-org/icu4x/pull/6674):...

Reconsider UTF-32 support

> Which perf delta are you referring to as "tiny"? The delta between specializing for UCS2 vs. feeding UCS2 input to code that's prepared to handle the superset that is...

Reconsider UTF-32 support

> return (Encoding1::empty(), left, right) This is worse than operating on the `char` level as last resort.