icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add `is_normalized_up_to` to Normalizer

Open CanadaHonk opened this issue 2 years ago • 7 comments

Closes #4256. Added UTF8 variant to FFI as is_normalized_up_to. No UTF16 tests or fuzzing yet.

CanadaHonk avatar Nov 20 '23 16:11 CanadaHonk

Added more UTF8 tests.

CanadaHonk avatar Nov 20 '23 16:11 CanadaHonk

Added UTF8 variant to FFI as is_normalized_up_to.

CanadaHonk avatar Nov 20 '23 17:11 CanadaHonk

Added UTF16 variant to FFI coverage allowlist.

CanadaHonk avatar Nov 20 '23 17:11 CanadaHonk

  • @hsivonen - Are we relitigating the decision on well-formedness?
  • @sffc - I don't recall having an explicit decision about that.
  • @sffc - How disruptive would it be to roundtrip ill-formed text?
  • @hsivonen - Tricky in UTF-8, and I feel there isn't much demand for that. In UTF-16, probably more doable. We would want the internal type to be non-char; I'm not sure how disruptive that would be. One option would be to detect non-scalar-values before handing them to the algorithm. But what's the use case of sticking arbitrary binary data in a JavaScript string and then normalizing it as a Unicode string? Someone wrote the simplest possible spec language and then the web reflects ICU4C behavior's of not scrubbing unpaired surrogates. But for us it's simpler to scrub them. Also, when Gecko was switching the style system, we took the risk of not round-tripping unpaired surrages through the CSS object model, and it was fine. I'm skeptical about anyone wanting to both use normalization and do things that would be disruptive if unpaired surrogates were replaced by the replacement character.
  • @sffc - Ok, we don't have anyone clammoring right now for changing unpaired surrogate behavior, so let's not change anything right now, but keep the door open in case someone asks for it.
  • @hsivonen - Okay, so then to the topic in the issue. is_normalized says yes if the string contains unpaired surrogates, but that isn't copy-on-write if is_normalized_up_to stops at one.
  • @sffc - My initial reaction is we should just change is_normalized to have that same behavior (returns false if the string contains unpaired surrogates).
  • @Manishearth - I'm fine with handling ill-formed UTF-16 if someone asks us. On the web, WPT is the place I've encountered it. People write tests for it, but I don't know we see it in the real world. For normalize, we should do something, but for is_normalized, it makes sense to just return false and document it.
  • @hsivonen - Cool, so what should we name the error? I'm inclined to use () as the error.
  • @Manishearth - Ehh, I'm inclined to a named error, but if you're confident there's only going to be one error type... we've done it before. But generally we have named errors, one per crate.
  • @sffc - The standard library has a lot of these unit errors. For example, TryFromCharError. I think we should just do that and make it implement Display.
  • @robertbastian - Struct with no fields seems easier.

Conclusion:

  • Change is_normalized to return false on strings containing unpaired surrogates
  • Make is_normalized_up_to stop when it encounters an unpaired surrogate
  • Make a new unit struct error type in @hsivonen's crate that implements core::fmt::Display

LGTM: @hsivonen @sffc @Manishearth @robertbastian

sffc avatar Nov 30 '23 16:11 sffc

This should also land behind an experimental feature.

robertbastian avatar Mar 07 '24 19:03 robertbastian

This should also land behind an experimental feature.

Why? I thought we resolved in the November 30 discussion minuted above that this can land unconditionally.

hsivonen avatar Mar 22 '24 12:03 hsivonen

That's not clear to me from the minutes, and I don't remember details of this discussion. That said, "stable" in 1.5 doesn't mean much anyway, as we can make breaking changes in 2.0, so it's not that important.

robertbastian avatar Mar 22 '24 12:03 robertbastian