icu4x
icu4x copied to clipboard
Finalize name and location of UnvalidatedX
Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.
There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.
I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.
How about schrodinger::SchrödingerStr? (also re-exported without the diacritic)
Discuss with:
- @Manishearth
- @robertbastian
- @sffc
Optional:
- @echeran
- @skius
I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.
I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.
Thanks for pointing this out, TIL 🤢
I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like serde_maybe::MaybeStr, however serde-maybe is already taken unfortunately.
Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ?
Previous discussion and background why we want the extra layer: https://github.com/unicode-org/icu4x/issues/2489
Two main reasons for the abstraction:
- We can use serialize_bytes
- We can serialize as a string for readability in JSON
The current semantics of UnvalidatedStr are:
- Serialization: bytes in binary, string or error in human-readable
- Deserialization: bytes in binary, string in human-readable
This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.
A name I've been using on my branch is BytesOrStr (could also be StrOrBytes). In that pattern, just prepend "BytesOr" or append "OrBytes" to the validated type name.
BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.
I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, StrAsBytes/BytesStr?
Otherwise maybe some non-negated antonyms of "trusted": QuestionableStr, ShadyStr?
Haha I love ShadyStr. SketchyStr could be another option.
My problem with these alternatives is that (imo) they either require more thinking than UnvalidatedStr, e.g., SketchyStr, or they are somewhat ambiguous, e.g., MaybeStr - to me this sounds like the case where the value is not a Str is supported, like the Haskell Maybe (Rust's Option), or is there some precedent for using Maybe to mean Unvalidated?
I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead (serdefer makes for a cute crate name in that case). Or LaterStr if you want to use a simpler word.
I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there.
I agree with @Manishearth about the naming, UnvalidatedStr perfectly describes what the type is. If we have to avoid the negation, I like LaterStr the most out of all alternatives discussed in this thread.
BytesOrStrsounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.
This was a progression from my suggestion that it may be useful to rethink the semantics:
The current semantics of UnvalidatedStr are:
- Serialization: bytes in binary, string or error in human-readable
- Deserialization: bytes in binary, string in human-readable
This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.
In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible.
The unvalidated name is available. We could use unvalidated::Str and unvalidated::Char
Not opposed.
I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?
Should we consider a name like PotentiallyIllFormedUtf8?
Is there a semantic difference?
Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?
I really really dislike all the "clever" names.
In this case I can't immediately think of anything and unvalidated is available and accurate, we should just use that imo.
I am not very happy with that proposal because:
- In the OP, I said I would prefer to avoid negated terms like
unordisornon. I prefer to start with a positive term. unvalidated::Strviolates our style guide naming. We don't want to importunvalidated::Strand useStrat call sites; that is super confusing. We could need to either import it asunvalidated::Str as UnvalidatedStror exportunvalidated::UnvalidatedStr.
Bikeshed, with some help from gemini:
RawStrDeferredStrQuasiStrChameleonStrProspectStrOptimisticStrPotentialStr
If we don't like any of those, how about choosing a letter of the alphabet, like
WStrXStrYStrZStr
Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. ZeroMap<XStr, PatternULE> seems fairly clear: the key is a string that is potentially a bit weird, and the value is a ULE pattern type.
I'm fine with unvalidated::UnvalidatedStr.
I don't really think any of the other names work well here. I think the un is warranted here.
What did you all think of PotentiallyIllFormedUtf8
"Potentially ill-formed UTF-8" is an established concept in the industry. Actually we already use it in icu_segmenter.
The type name is a bit long, so we could do (bikeshed):
potentially_ill_formed::PifStrpotentially_ill_formed::PillStrpotentially_ill_formed::PIllStr
I also observe that the crate name pill is available. How about pill::PillStr? 😃
"Pill" is aligned with our hospital theme of "ICU"
A potential advantage of this approach is that we could start having APIs such as
pub fn process(s: &str)
pub fn process_utf8(s: &PillStr)
pub fn process_utf16(s: &PillUtf16)
What did you all think of
PotentiallyIllFormedUtf8
Not strongly opposed, feels long. Still prefer the current Unvalidated naming.
Not a fan of pill. I think it's okayish.
Discussion Conclusions:
- This lives in a crate named after the type (probably), that depends on
serdeandzerovec, with feature flags. - Fine with using these for
_utf8and_utf16APIs that accept potentially invalid stuff
LGTM: @manishearth, @sffc, @robertbastian
Name conclusion: bikeshed later (Manish and Rob prefer Unvalidated), also need to bikeshed _utf8 suffix
The crate should also have an optional dependency on writeable as discussed in https://github.com/unicode-org/icu4x/pull/4786#discussion_r1559627673
I really want to start the bikeshed for this. Let's focus just on the dynamically sized type that can be infallibly converted from a byte sequence, fallibly converted to a str, and includes impls for things like serde, zerovec, and writeable.
potentially_ill_formed- 1a
potentially_ill_formed::PotentiallyIllFormedStr,potentially_ill_formed::PotentiallyIllFormedUtf16 - 1b
potentially_ill_formed::PotentiallyIllFormedUtf8,potentially_ill_formed::PotentiallyIllFormedUtf16
- 1a
pill- 2a
pill::PillStr,pill::Pill16 - 2b
pill::PillUtf8,pill::PillUtf16 - 2c
pill::Pill8,pill::Pill16
- 2a
unvalidated- 3a
unvalidated::UnvalidatedStr,unvalidated::UnvalidatedUtf16 - 3b
unvalidated::UnvalidatedUtf8,unvalidated::UnvalidatedUtf16
- 3a
quasistr- 4a
quasistr::QuasiStr,quasistr::Quasi16 - 4b
quasistr::QuasiUtf8,quasistr::QuasiUtf16 - 4c
quasistr::Quasi8,quasistr::Quasi16
- 4a
quasi_str- 5a
quasi_str::QuasiStr,quasi_str::Quasi16 - 5b
quasi_str::QuasiUtf8,quasi_str::QuasiUtf16 - 5b
quasi_str::Quasi8,quasi_str::Quasi16
- 5a
Any more suggestions before I send out a ballot? I will distribute it after this week's ICU4X-WG call.