Uplift Utf8Bytes
This adds a Bytes wrapper with a UTF-8 validity invariant, uplifting the type found as:
bytes_utils::Str(2.4M downloads/month)bytestring::ByteString(1.7M downloads/month)tungstenite::Utf8Bytespub(crate) http::ByteStraxum::extract::ws::Utf8Bytes(itself a wrapper aroundtungstenite::Utf8Bytes, so as to not depend on tungstenite in the public API)pub(crate) h2::hpack::ByteStrasync_nats::Subject- and probably elsewhere.
My approach here was to just copy all the API surface from Bytes that could be adapted (as well as a function based on String::from_utf8()), but if this is too big to merge all at once, I could take out some of the methods and add them in a followup PR. I didn't directly copy anything from any of the aforementioned crates, in case that matters wrt licensing, though I did look a little bit at how bytestring phrased their docs.
The Hash impl delegates to str as Hash rather than Bytes as Hash, meaning it's not hash-compatible with Bytes. This differs from http::ByteStr and tungstenite::Utf8Bytes, but it's required, since Utf8Bytes implements Borrow<str> (and so does tungstenite::Utf8Bytes, which means that was probably an oversight).
wrto the name: Utf8Bytes seemed to me the most straightforward description of what it is, and the axum case is what spurred me to open this. ByteString seems like it could imply that std::string::String isn't composed of bytes (or that it's something like bstr; not necessarily utf8), and BytesString is a bit clunky with the geminated s. Str is simple like Bytes is, but doesn't match up with OsStr or CStr, which are slices and not owned buffers. (Edit: another option is just String, with the idea that it would primarily be referred to as bytes::String. That's probably a bad idea though). However, I'm not dead set on anything, and would be totally fine with renaming it if that's desired.
FWIW I think this is a good idea. Other than tungstenite and http, which have already been mentioned above, I'll list a few more crates that have their own internal ByteStr type or equivalent to drive home the point:
I'm not sure what the procedure is here, but would it be possible to get another review on this?
I think it would be good for bytes to provide this, but we need to get the API perfectly right on first try when adding it to the bytes crate as breaking changes are impossible. It may be better to start out by placing this in a downstream "blessed" crate and moving it into bytes later.
I do think that seems reasonable, but also, almost the whole API proposed here is basically a subset of that of Bytes withs/[u8]/str, plus a few things based on String (from_utf8, from_utf8_unchecked, as_str, into_bytes, TryFrom<Bytes>). I was intentionally trying to innovate as little as possible with this, with the idea that these functions and impls are well established already by either Bytes or std. I also think bytes_utils and bytestring have kinda already functioned as that testing ground - certainly perhaps more data could be collected about what needs improved regarding the API, but I don't think a full nother crate needs to be released for that. As I said, I'm happy to remove some API surface for an initial impl, but I think even just having something with Clone, Deref, from_utf8_unchecked and into_bytes would help the ecosystem start to unify on a single type.
I could buy that argument, but you are innovating in the name of the struct. ;)
To add a counter point, I have "needed" such a struct in several crates, but it hasn't bothered me at all to include the tiny amount of code to make a private struct.
So I haven't felt a need for a unified type in the ecosystem.
I mean, I can see the point that downstream crates would like to avoid using unsafe, which you need to make a nice BytesString wrapper.
To add a counter point, I have "needed" such a struct in several crates, but it hasn't bothered me at all to include the tiny amount of code to make a private struct.
FWIW, I'd generally agree with this - I've not needed this functionality all that often, and when I have I've been content just slapping a dependency on bytestring to fix it. However, that only goes so far, and honestly the main motivator for this was the axum case; the moment you need to export a utf-8-validated-Bytes between crates, it starts to expose the problems with this approach. I mentioned this in https://github.com/tokio-rs/axum/issues/3082, discussing different ways of wrapping Utf8Bytes for axum:
- Make a new axum version of Utf8Bytes
- This would mean a new API surface that's not actually websocket-specific, but actually a weirdly useful general utility for a utf8 Bytes wrapper. Definitely doable, but means there might be further feature requests for str versions of Bytes methods/Bytes versions of str methods, and also that the user will need to do conversions if they're already using a version of this type from something like bytestring or bytes-utils.
I'm running into those issues now in a project; I'm trying to convert between bytestring and axum::extract::ws::Utf8Bytes, but the latter doesn't have a from_utf8_unchecked function; I can open a PR for it, but then it's that treadmill I predicted.
I could buy that argument, but you are innovating in the name of the struct. ;)
True :upside_down_face: I'm just not sure how a semi-official crate will help bikeshed the name; some sort of RFC to the community that's posted in TWIR might be more likely to help work out kinks in the name/get opinions on it. Or, it'd be probably even better to just be a top-down decision made by the same people who finalized the API names for bytes 1.0.
That's true, I don't think a downstream crate helps on the struct name issue. The name I like is ByteString. Anyone has strong objections?
What about the module? string? I guess theoretically it could not exist for now, and the error returned from from_utf8 could be a sealed type, but that might be a bit much.
I guess it is worth to consider what they are going to do in https://github.com/rust-lang/rust/issues/134915#issuecomment-2899431966.
The name I like is
ByteString.
That sounds like "string of bytes", like in that RFC you mentioned or in "bstr" crate.
I would prefer:
BytesString(plural, meaning string based onBytes)bytes::String
I guess BytesString could also work. As for String ... maybe! But it might be too easy to confuse with String.
I wanted to share support of @coolreader18's original proposal of naming this Utf8Btyes. I think its the most descriptive (and most concise) and i've seen similar naming schemes with other similar wrappers in the ecosystem, e.g. camino which provides utf8 enforcement of the stdlib's Path types like with Utf8Path.