icu4x `Locale`/`LanguageIdentifier` `canonicalize` can return `Cow<str>` instead of String

Oct 15 '22 00:10 robertbastian

I support this change.

Oct 15 '22 13:10 zbraniecki

That said, why does canonicalize return String and not Locale/LanguageIdentifier?

Jun 18 '24 21:06 robertbastian

Maybe @zbraniecki can answer that

Jun 19 '24 11:06 sffc

That said, why does canonicalize return String and not Locale/LanguageIdentifier?

canonicalize is a shortcut to &str->canonicalized(&str). You can achieve exactly the same with str.parse().to_string().

I support this being Cow because in case there are no transformations needed this would become non-alloc.

Jun 20 '24 00:06 zbraniecki

While we're at it, can we please rename this function to "normalize_syntax" since that's what it does, as opposed to LocaleCanonicalizer that does actual data driven canonicalization

Jun 20 '24 02:06 sffc

I struggle with the naming here. LDML uses the term canonical for two operations:

canonical syntax
canonical locale id

This operation secures canonical syntax. I feel normalization is an awkward term for it but it may be familiarity effect.

What I think is more important is that we establish a precedence and use the same name for other scenarios where we're lenient on the input, strict on the output and we normalize stringified representation of things like timezones, etc.

Not going to block normalize_syntax.

Aug 22 '24 21:08 zbraniecki

If it helps, it is the language I used in TimeZoneIdMapper:

https://unicode-org.github.io/icu4x/rustdoc/icu_timezone/struct.TimeZoneIdMapper.html#normalization-vs-canonicalization

"normalize" means to keep the original semantic meaning but massage it into a normalized form, and "canonicalize" means to apply aliases and other changes that could change the semantic meaning, usually by taking something old / obsolete / legacy and making it something modern.

Aug 22 '24 22:08 sffc

I took a look into this and am a bit concerned about the tradeoff here. Here's today's signature:

fn canonicalize<S: AsRef<[u8]>>(input: S) -> Result<String, ParseError>;

The nice thing about this is that it accepts anything that can be refed as [u8].

We can change it to:

fn canonicalize<'s, S: AsRef<[u8]> + 's>(input: S) -> Result<Cow<'s, str>, ParseError>;

but the only value of Cow comes if we can return Borrowed version, but I can't find a way to borrow input as ref, then return it as str.

Do we want to change it to

fn canonicalize<'s>(input: &'s [u8]) -> Result<Cow<'s, str>, ParseError>;

Is there a way to keep AsRef and Cow::Borrowed?
If not, which tradeoff option we prefer?

Sep 17 '24 17:09 zbraniecki

We don't particularly like AsRef<[u8]> inputs anway, and if this blocks borrowing it should be split into canonicalize_utf8(&[u8]) and canonicalize(&str).

Sep 18 '24 08:09 robertbastian

Ok, so we're moving the API to:

fn canonicalize_utf8<'s>(input: &'s [u8]) -> Result<Cow<'s, str>, ParseError>;
fn canonicalize<'s>(input: &'s str) -> Result<Cow<'s, str>, ParseError>;

Aligned?

Sep 18 '24 20:09 zbraniecki

Reopening for normalize vs canonicalize discussion, and because the method is still not allocation-free for already-normalized strings (which was the whole point of returning Cow).

Oct 25 '24 06:10 robertbastian

We have a method on Locale/LanguageIdentifier called normalizing_eq(str), which does something equivalent to parse + cmp, like canonicalize(str) does parse + to_string. These names should align.

Oct 25 '24 08:10 robertbastian

I think the methods should be called "normalize" because they normalize syntax but do not canonicalize subtags.

Oct 26 '24 06:10 sffc

@sffc I think Locale::canonicalize should be Locale::normalize
@Manishearth We have LocaleCanonicalizer which is a different operation. But "normalize" sounds like NFC.
@robertbastian We already had some discussion regarding normalizing_eq.
@zbraniecki For me, it's "disagree and commit". I don't think any of the terms are clear.
@sffc Regarding the comment from @Manishearth, we already use the word "normalize" toi mean different things across the codebase. For example, ECMA-402 uses it to refer to time zone ID case normalization.
@Manishearth I just wanted to highlight the slight degree of confusion. But I think we can move ahead with normalize.
@sffc On board with being consistent with "normalize" being "the data doesn't really change" whereas "canonicalize" is "remove legacy stuff".

Conclusion:

Locale::normalize and Locale::normalize_utf8
Keep Locale::normalizing_eq
Same for LanguageIdentifier

LGTM: @manishearth @sffc @robertbastian (@zbraniecki)

Oct 31 '24 17:10 robertbastian

icu4x icu4x copied to clipboard

`Locale`/`LanguageIdentifier` `canonicalize` can return `Cow<str>` instead of String

icu4x
icu4x copied to clipboard