icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

`Locale`/`LanguageIdentifier` `canonicalize` can return `Cow<str>` instead of String

Open robertbastian opened this issue 3 years ago • 10 comments

robertbastian avatar Oct 15 '22 00:10 robertbastian

I support this change.

zbraniecki avatar Oct 15 '22 13:10 zbraniecki

That said, why does canonicalize return String and not Locale/LanguageIdentifier?

robertbastian avatar Jun 18 '24 21:06 robertbastian

Maybe @zbraniecki can answer that

sffc avatar Jun 19 '24 11:06 sffc

That said, why does canonicalize return String and not Locale/LanguageIdentifier?

canonicalize is a shortcut to &str->canonicalized(&str). You can achieve exactly the same with str.parse().to_string().

I support this being Cow because in case there are no transformations needed this would become non-alloc.

zbraniecki avatar Jun 20 '24 00:06 zbraniecki

While we're at it, can we please rename this function to "normalize_syntax" since that's what it does, as opposed to LocaleCanonicalizer that does actual data driven canonicalization

sffc avatar Jun 20 '24 02:06 sffc

I struggle with the naming here. LDML uses the term canonical for two operations:

  • canonical syntax
  • canonical locale id

This operation secures canonical syntax. I feel normalization is an awkward term for it but it may be familiarity effect.

What I think is more important is that we establish a precedence and use the same name for other scenarios where we're lenient on the input, strict on the output and we normalize stringified representation of things like timezones, etc.

Not going to block normalize_syntax.

zbraniecki avatar Aug 22 '24 21:08 zbraniecki

If it helps, it is the language I used in TimeZoneIdMapper:

https://unicode-org.github.io/icu4x/rustdoc/icu_timezone/struct.TimeZoneIdMapper.html#normalization-vs-canonicalization

"normalize" means to keep the original semantic meaning but massage it into a normalized form, and "canonicalize" means to apply aliases and other changes that could change the semantic meaning, usually by taking something old / obsolete / legacy and making it something modern.

sffc avatar Aug 22 '24 22:08 sffc

I took a look into this and am a bit concerned about the tradeoff here. Here's today's signature:

fn canonicalize<S: AsRef<[u8]>>(input: S) -> Result<String, ParseError>;

The nice thing about this is that it accepts anything that can be refed as [u8].

We can change it to:

fn canonicalize<'s, S: AsRef<[u8]> + 's>(input: S) -> Result<Cow<'s, str>, ParseError>;

but the only value of Cow comes if we can return Borrowed version, but I can't find a way to borrow input as ref, then return it as str.

Do we want to change it to

fn canonicalize<'s>(input: &'s [u8]) -> Result<Cow<'s, str>, ParseError>;
  1. Is there a way to keep AsRef and Cow::Borrowed?
  2. If not, which tradeoff option we prefer?

zbraniecki avatar Sep 17 '24 17:09 zbraniecki

We don't particularly like AsRef<[u8]> inputs anway, and if this blocks borrowing it should be split into canonicalize_utf8(&[u8]) and canonicalize(&str).

robertbastian avatar Sep 18 '24 08:09 robertbastian

Ok, so we're moving the API to:

fn canonicalize_utf8<'s>(input: &'s [u8]) -> Result<Cow<'s, str>, ParseError>;
fn canonicalize<'s>(input: &'s str) -> Result<Cow<'s, str>, ParseError>;

Aligned?

zbraniecki avatar Sep 18 '24 20:09 zbraniecki

Reopening for normalize vs canonicalize discussion, and because the method is still not allocation-free for already-normalized strings (which was the whole point of returning Cow).

robertbastian avatar Oct 25 '24 06:10 robertbastian

We have a method on Locale/LanguageIdentifier called normalizing_eq(str), which does something equivalent to parse + cmp, like canonicalize(str) does parse + to_string. These names should align.

robertbastian avatar Oct 25 '24 08:10 robertbastian

I think the methods should be called "normalize" because they normalize syntax but do not canonicalize subtags.

sffc avatar Oct 26 '24 06:10 sffc

  • @sffc I think Locale::canonicalize should be Locale::normalize
  • @Manishearth We have LocaleCanonicalizer which is a different operation. But "normalize" sounds like NFC.
  • @robertbastian We already had some discussion regarding normalizing_eq.
  • @zbraniecki For me, it's "disagree and commit". I don't think any of the terms are clear.
  • @sffc Regarding the comment from @Manishearth, we already use the word "normalize" toi mean different things across the codebase. For example, ECMA-402 uses it to refer to time zone ID case normalization.
  • @Manishearth I just wanted to highlight the slight degree of confusion. But I think we can move ahead with normalize.
  • @sffc On board with being consistent with "normalize" being "the data doesn't really change" whereas "canonicalize" is "remove legacy stuff".

Conclusion:

  • Locale::normalize and Locale::normalize_utf8
  • Keep Locale::normalizing_eq
  • Same for LanguageIdentifier

LGTM: @manishearth @sffc @robertbastian (@zbraniecki)

robertbastian avatar Oct 31 '24 17:10 robertbastian