icu4x
icu4x copied to clipboard
`Locale`/`LanguageIdentifier` `canonicalize` can return `Cow<str>` instead of String
I support this change.
That said, why does canonicalize return String and not Locale/LanguageIdentifier?
Maybe @zbraniecki can answer that
That said, why does canonicalize return String and not Locale/LanguageIdentifier?
canonicalize is a shortcut to &str->canonicalized(&str). You can achieve exactly the same with str.parse().to_string().
I support this being Cow because in case there are no transformations needed this would become non-alloc.
While we're at it, can we please rename this function to "normalize_syntax" since that's what it does, as opposed to LocaleCanonicalizer that does actual data driven canonicalization
I struggle with the naming here. LDML uses the term canonical for two operations:
- canonical syntax
- canonical locale id
This operation secures canonical syntax. I feel normalization is an awkward term for it but it may be familiarity effect.
What I think is more important is that we establish a precedence and use the same name for other scenarios where we're lenient on the input, strict on the output and we normalize stringified representation of things like timezones, etc.
Not going to block normalize_syntax.
If it helps, it is the language I used in TimeZoneIdMapper:
https://unicode-org.github.io/icu4x/rustdoc/icu_timezone/struct.TimeZoneIdMapper.html#normalization-vs-canonicalization
"normalize" means to keep the original semantic meaning but massage it into a normalized form, and "canonicalize" means to apply aliases and other changes that could change the semantic meaning, usually by taking something old / obsolete / legacy and making it something modern.
I took a look into this and am a bit concerned about the tradeoff here. Here's today's signature:
fn canonicalize<S: AsRef<[u8]>>(input: S) -> Result<String, ParseError>;
The nice thing about this is that it accepts anything that can be refed as [u8].
We can change it to:
fn canonicalize<'s, S: AsRef<[u8]> + 's>(input: S) -> Result<Cow<'s, str>, ParseError>;
but the only value of Cow comes if we can return Borrowed version, but I can't find a way to borrow input as ref, then return it as str.
Do we want to change it to
fn canonicalize<'s>(input: &'s [u8]) -> Result<Cow<'s, str>, ParseError>;
- Is there a way to keep AsRef and Cow::Borrowed?
- If not, which tradeoff option we prefer?
We don't particularly like AsRef<[u8]> inputs anway, and if this blocks borrowing it should be split into canonicalize_utf8(&[u8]) and canonicalize(&str).
Ok, so we're moving the API to:
fn canonicalize_utf8<'s>(input: &'s [u8]) -> Result<Cow<'s, str>, ParseError>;
fn canonicalize<'s>(input: &'s str) -> Result<Cow<'s, str>, ParseError>;
Aligned?
Reopening for normalize vs canonicalize discussion, and because the method is still not allocation-free for already-normalized strings (which was the whole point of returning Cow).
We have a method on Locale/LanguageIdentifier called normalizing_eq(str), which does something equivalent to parse + cmp, like canonicalize(str) does parse + to_string. These names should align.
I think the methods should be called "normalize" because they normalize syntax but do not canonicalize subtags.
- @sffc I think
Locale::canonicalizeshould beLocale::normalize - @Manishearth We have
LocaleCanonicalizerwhich is a different operation. But "normalize" sounds like NFC. - @robertbastian We already had some discussion regarding
normalizing_eq. - @zbraniecki For me, it's "disagree and commit". I don't think any of the terms are clear.
- @sffc Regarding the comment from @Manishearth, we already use the word "normalize" toi mean different things across the codebase. For example, ECMA-402 uses it to refer to time zone ID case normalization.
- @Manishearth I just wanted to highlight the slight degree of confusion. But I think we can move ahead with normalize.
- @sffc On board with being consistent with "normalize" being "the data doesn't really change" whereas "canonicalize" is "remove legacy stuff".
Conclusion:
Locale::normalizeandLocale::normalize_utf8- Keep
Locale::normalizing_eq - Same for
LanguageIdentifier
LGTM: @manishearth @sffc @robertbastian (@zbraniecki)