icu4x
icu4x copied to clipboard
Add `is_normalized_up_to` to Normalizer
Closes #4256. Added UTF8 variant to FFI as is_normalized_up_to. No UTF16 tests or fuzzing yet.
Added more UTF8 tests.
Added UTF8 variant to FFI as is_normalized_up_to.
Added UTF16 variant to FFI coverage allowlist.
- @hsivonen - Are we relitigating the decision on well-formedness?
- @sffc - I don't recall having an explicit decision about that.
- @sffc - How disruptive would it be to roundtrip ill-formed text?
- @hsivonen - Tricky in UTF-8, and I feel there isn't much demand for that. In UTF-16, probably more doable. We would want the internal type to be non-
char; I'm not sure how disruptive that would be. One option would be to detect non-scalar-values before handing them to the algorithm. But what's the use case of sticking arbitrary binary data in a JavaScript string and then normalizing it as a Unicode string? Someone wrote the simplest possible spec language and then the web reflects ICU4C behavior's of not scrubbing unpaired surrogates. But for us it's simpler to scrub them. Also, when Gecko was switching the style system, we took the risk of not round-tripping unpaired surrages through the CSS object model, and it was fine. I'm skeptical about anyone wanting to both use normalization and do things that would be disruptive if unpaired surrogates were replaced by the replacement character. - @sffc - Ok, we don't have anyone clammoring right now for changing unpaired surrogate behavior, so let's not change anything right now, but keep the door open in case someone asks for it.
- @hsivonen - Okay, so then to the topic in the issue.
is_normalizedsays yes if the string contains unpaired surrogates, but that isn't copy-on-write ifis_normalized_up_tostops at one. - @sffc - My initial reaction is we should just change
is_normalizedto have that same behavior (returns false if the string contains unpaired surrogates). - @Manishearth - I'm fine with handling ill-formed UTF-16 if someone asks us. On the web, WPT is the place I've encountered it. People write tests for it, but I don't know we see it in the real world. For
normalize, we should do something, but foris_normalized, it makes sense to just return false and document it. - @hsivonen - Cool, so what should we name the error? I'm inclined to use
()as the error. - @Manishearth - Ehh, I'm inclined to a named error, but if you're confident there's only going to be one error type... we've done it before. But generally we have named errors, one per crate.
- @sffc - The standard library has a lot of these unit errors. For example, TryFromCharError. I think we should just do that and make it implement Display.
- @robertbastian - Struct with no fields seems easier.
Conclusion:
- Change
is_normalizedto return false on strings containing unpaired surrogates - Make
is_normalized_up_tostop when it encounters an unpaired surrogate - Make a new unit struct error type in @hsivonen's crate that implements
core::fmt::Display
LGTM: @hsivonen @sffc @Manishearth @robertbastian
This should also land behind an experimental feature.
This should also land behind an experimental feature.
Why? I thought we resolved in the November 30 discussion minuted above that this can land unconditionally.
That's not clear to me from the minutes, and I don't remember details of this discussion. That said, "stable" in 1.5 doesn't mean much anyway, as we can make breaking changes in 2.0, so it's not that important.