icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Finalize name and location of UnvalidatedX

Open sffc opened this issue 2 years ago • 55 comments
trafficstars

Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.

There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.

I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.

How about schrodinger::SchrödingerStr? (also re-exported without the diacritic)

Discuss with:

  • @Manishearth
  • @robertbastian
  • @sffc

Optional:

  • @echeran
  • @skius

sffc avatar Jun 19 '23 05:06 sffc

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

ajtribick avatar Jun 19 '23 05:06 ajtribick

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

Thanks for pointing this out, TIL 🤢

robertbastian avatar Jun 19 '23 08:06 robertbastian

I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like serde_maybe::MaybeStr, however serde-maybe is already taken unfortunately.

robertbastian avatar Jun 19 '23 08:06 robertbastian

Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ?

skius avatar Jun 19 '23 12:06 skius

Previous discussion and background why we want the extra layer: https://github.com/unicode-org/icu4x/issues/2489

robertbastian avatar Jun 19 '23 12:06 robertbastian

Two main reasons for the abstraction:

  1. We can use serialize_bytes
  2. We can serialize as a string for readability in JSON

sffc avatar Jun 19 '23 16:06 sffc

The current semantics of UnvalidatedStr are:

  • Serialization: bytes in binary, string or error in human-readable
  • Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

sffc avatar Jun 19 '23 20:06 sffc

A name I've been using on my branch is BytesOrStr (could also be StrOrBytes). In that pattern, just prepend "BytesOr" or append "OrBytes" to the validated type name.

sffc avatar Jun 20 '23 05:06 sffc

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

robertbastian avatar Jun 20 '23 08:06 robertbastian

I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, StrAsBytes/BytesStr?

Otherwise maybe some non-negated antonyms of "trusted": QuestionableStr, ShadyStr?

skius avatar Jun 20 '23 08:06 skius

Haha I love ShadyStr. SketchyStr could be another option.

robertbastian avatar Jun 20 '23 08:06 robertbastian

My problem with these alternatives is that (imo) they either require more thinking than UnvalidatedStr, e.g., SketchyStr, or they are somewhat ambiguous, e.g., MaybeStr - to me this sounds like the case where the value is not a Str is supported, like the Haskell Maybe (Rust's Option), or is there some precedent for using Maybe to mean Unvalidated?

skius avatar Jun 20 '23 09:06 skius

I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead (serdefer makes for a cute crate name in that case). Or LaterStr if you want to use a simpler word.

I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there.

Manishearth avatar Jun 20 '23 09:06 Manishearth

I agree with @Manishearth about the naming, UnvalidatedStr perfectly describes what the type is. If we have to avoid the negation, I like LaterStr the most out of all alternatives discussed in this thread.

skius avatar Jun 20 '23 10:06 skius

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

This was a progression from my suggestion that it may be useful to rethink the semantics:

The current semantics of UnvalidatedStr are:

  • Serialization: bytes in binary, string or error in human-readable
  • Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible.

sffc avatar Jun 20 '23 19:06 sffc

The unvalidated name is available. We could use unvalidated::Str and unvalidated::Char

robertbastian avatar Feb 28 '24 15:02 robertbastian

Not opposed.

Manishearth avatar Feb 28 '24 16:02 Manishearth

I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

sffc avatar Feb 29 '24 06:02 sffc

Should we consider a name like PotentiallyIllFormedUtf8?

Is there a semantic difference?

sffc avatar Feb 29 '24 06:02 sffc

Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

I really really dislike all the "clever" names.

robertbastian avatar Feb 29 '24 10:02 robertbastian

In this case I can't immediately think of anything and unvalidated is available and accurate, we should just use that imo.

Manishearth avatar Feb 29 '24 16:02 Manishearth

I am not very happy with that proposal because:

  1. In the OP, I said I would prefer to avoid negated terms like un or dis or non. I prefer to start with a positive term.
  2. unvalidated::Str violates our style guide naming. We don't want to import unvalidated::Str and use Str at call sites; that is super confusing. We could need to either import it as unvalidated::Str as UnvalidatedStr or export unvalidated::UnvalidatedStr.

sffc avatar Feb 29 '24 17:02 sffc

Bikeshed, with some help from gemini:

  • RawStr
  • DeferredStr
  • QuasiStr
  • ChameleonStr
  • ProspectStr
  • OptimisticStr
  • PotentialStr

If we don't like any of those, how about choosing a letter of the alphabet, like

  • WStr
  • XStr
  • YStr
  • ZStr

Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. ZeroMap<XStr, PatternULE> seems fairly clear: the key is a string that is potentially a bit weird, and the value is a ULE pattern type.

sffc avatar Feb 29 '24 17:02 sffc

I'm fine with unvalidated::UnvalidatedStr.

I don't really think any of the other names work well here. I think the un is warranted here.

Manishearth avatar Feb 29 '24 18:02 Manishearth

What did you all think of PotentiallyIllFormedUtf8

"Potentially ill-formed UTF-8" is an established concept in the industry. Actually we already use it in icu_segmenter.

The type name is a bit long, so we could do (bikeshed):

  • potentially_ill_formed::PifStr
  • potentially_ill_formed::PillStr
  • potentially_ill_formed::PIllStr

I also observe that the crate name pill is available. How about pill::PillStr? 😃

"Pill" is aligned with our hospital theme of "ICU"

sffc avatar Feb 29 '24 20:02 sffc

A potential advantage of this approach is that we could start having APIs such as

pub fn process(s: &str)
pub fn process_utf8(s: &PillStr)
pub fn process_utf16(s: &PillUtf16)

sffc avatar Feb 29 '24 20:02 sffc

What did you all think of PotentiallyIllFormedUtf8

Not strongly opposed, feels long. Still prefer the current Unvalidated naming.

Not a fan of pill. I think it's okayish.

Manishearth avatar Feb 29 '24 21:02 Manishearth

Discussion Conclusions:

  1. This lives in a crate named after the type (probably), that depends on serde and zerovec, with feature flags.
  2. Fine with using these for _utf8 and _utf16 APIs that accept potentially invalid stuff

LGTM: @manishearth, @sffc, @robertbastian

Name conclusion: bikeshed later (Manish and Rob prefer Unvalidated), also need to bikeshed _utf8 suffix

Manishearth avatar Mar 15 '24 14:03 Manishearth

The crate should also have an optional dependency on writeable as discussed in https://github.com/unicode-org/icu4x/pull/4786#discussion_r1559627673

sffc avatar Apr 10 '24 16:04 sffc

I really want to start the bikeshed for this. Let's focus just on the dynamically sized type that can be infallibly converted from a byte sequence, fallibly converted to a str, and includes impls for things like serde, zerovec, and writeable.

  1. potentially_ill_formed
    • 1a potentially_ill_formed::PotentiallyIllFormedStr, potentially_ill_formed::PotentiallyIllFormedUtf16
    • 1b potentially_ill_formed::PotentiallyIllFormedUtf8, potentially_ill_formed::PotentiallyIllFormedUtf16
  2. pill
    • 2a pill::PillStr, pill::Pill16
    • 2b pill::PillUtf8, pill::PillUtf16
    • 2c pill::Pill8, pill::Pill16
  3. unvalidated
    • 3a unvalidated::UnvalidatedStr, unvalidated::UnvalidatedUtf16
    • 3b unvalidated::UnvalidatedUtf8, unvalidated::UnvalidatedUtf16
  4. quasistr
    • 4a quasistr::QuasiStr, quasistr::Quasi16
    • 4b quasistr::QuasiUtf8, quasistr::QuasiUtf16
    • 4c quasistr::Quasi8, quasistr::Quasi16
  5. quasi_str
    • 5a quasi_str::QuasiStr, quasi_str::Quasi16
    • 5b quasi_str::QuasiUtf8, quasi_str::QuasiUtf16
    • 5b quasi_str::Quasi8, quasi_str::Quasi16

Any more suggestions before I send out a ballot? I will distribute it after this week's ICU4X-WG call.

sffc avatar Apr 10 '24 16:04 sffc