bstr
                                
                                 bstr copied to clipboard
                                
                                    bstr copied to clipboard
                            
                            
                            
                        RFC: 1.0 release?
For those coming here that don't know what bstr is: it is a string library for &[u8]. The essential difference between the strings in bstr and the &str type in std is that bstr treats &[u8] as conventionally UTF-8 instead of requiring that it be UTF-8. Its main utility is in contexts where you believe your data is UTF-8 (but it might not be completely UTF-8) and you either don't have any information about what its actual encoding is or do not want to pay for the UTF-8 validity check. A common example of this is reading data from files. The bstr documentation says a lot more.
This issue is about releasing 1.0. Since I do not currently have any plans for a 2.0, I would like to get as many eyes on this as possible. If you have any feedback with respect to API breaking changes, I would love to hear about it.
OK, so I promise that the 1.0 release is imminent. Here is an exhaustive list of planned breaking API changes, all of which are currently present on master (brought in via #104):
- Bytes::as_sliceis renamed to- Bytes::as_bytes.
- ByteVec::into_os_stringnow returns- Result<OsString, FromUtf8Error>instead of- Result<OsString, Vec<u8>>.
- ByteVec::into_path_bufnow returns- Result<PathBuf, FromUtf8Error>instead of- Result<PathBuf, Vec<u8>>.
- Find<'a>has been changed to- Find<'h, 'n>, which represents the lifetimes of both the haystack and the needle, instead of the shorter of the two.
- FindReverse<'a>has been changed to- FindReverse<'h, 'n>, which represents the lifetimes of both the haystack and the needle, instead of the shorter of the two.
- Split<'a>has been changed to- Split<h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
- SplitReverse<'a>has been changed to- SplitReverse<'h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
- SplitN<'a>has been changed to- SplitN<h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
- SplitNReverse<'a>has been changed to- SplitNReverse<h, 's>, which represents the lifetimes of both the haystack and the splitter, instead of the shorter of the two.
- ByteSlice::fieldsis now gated behind the- unicodefeature. Previously, it was available unconditionally.
- serde1has been renamed to- serde1-std, and- serde1-nostdhas been split into- serde1-allocand- serde1-core.
- BufReadExt::for_byte_linenow accepts- &mut selfinstead of- self.
- BufReadExt::for_byte_recordnow accepts- &mut selfinstead of- self.
- BufReadExt::for_byte_line_with_terminatornow accepts- &mut selfinstead of- self.
- BufReadExt::for_byte_record_with_terminatornow accepts- &mut selfinstead of- self.
Assuming this is an exhaustive list, and given that these are all very minor changes, I'm hopefully that migration from 0.2 to 1.0 will be very easy. Hopefully requiring no changes in most cases.
My plan is to release 1.0 on ~July 11, 2022~ July 18, 2022. If you have feedback to give, please do so now. :-)
Note that I've published 1.0.0-pre.1 to crates.io. It's not on docs.rs yet though.
(Below is the message I initial wrote a couple years ago. I'm way late to the party.)
It has been almost a year since I released bstr 0.2 with the major breaking change of moving most of the routines to extension traits. It seems like this has been a success. Namely, bstr's reverse dependency list is growing. More generally, I personally like working with the new API better than the old one. While I still bemoan the loss of a distinct type and its corresponding Debug impl as the One True Byte String, I think the benefits of the extension trait API have ended up outweighing that cost.
I've been giving some thought to bstr's API and its future evolution, and nothing immediately comes to mind in terms of breaking changes. That is, most everything I can think of are API additions, or at worst, deprecations. The only breaking change I can think of is to more carefully audit which API routines are available when unicode mode is disabled. I just want to make sure I'm not boxing myself into any corners there. (e.g., Some extant implementations might currently rely on std for its Unicode support, but it may wind up being the case that we want to re-implement some of those, which will require bringing in our own Unicode data. If that occurs, then those APIs should be gated behind the unicode feature.)
Otherwise, my feeling is that, unless I hear otherwise, I will make a 1.0 release in a few months. June 2020 is the 1 year anniversay of the 0.2 release, so that sounds like a good a time as any.
Thoughts?
cc @thomcc I know you've done some work on bstr and are actually using it, so would definitely appreciate if you have any thoughts here! Mostly what I'm looking for are things that we might want to do that will break the current API. While 1.0 doesn't necessarily mean "breaking changes must stop," I generally try to commit to a long termish period of stability for each major version in core libraries.
My main question would be: do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.
That said, part of the reason I don't use them much has been due to the fact that the library is pre-1.0 and these seem the most likely item to be removed to me.
In principal I like the idea of being able to specify something is a BString rather than a Vec<u8> -- in compound types like Vec<Vec<u8>> it can be a bit non-obvious what this represents, whereas Vec<BString> is much more clear (to me, anyway).
I think the main other changes I can think of would be API additions, and thus compatible.
do the BStr and BString types pull their own weight? They feel a bit redundant to me -- I essentially only ever use the former and even then, only as an argument in a fmt call, and that could just be a one-off wrapper returned by a method on the extension trait.
I think so. There are two main use cases I envision for them:
- They provide a target for trait impls that is distinct from &[u8]andVec<u8>. This is significant because the latter may be covered by blanket impls. (This is whyserde_bytesexists. One can usebstrinstead ofserde_bytessincebstrhas optional serde support.
- You can use BStringand&BStrtypes internally instead ofVec<u8>and&[u8]to automatically benefit from theirstd::fmt::Debugimpls. That way, you don't have to write custom debug impls.
Yep, that all makes sense to me, and I hadn't considered point 1 at all.
Another place that BString and &BStr are useful is to wrap bytes before using them in assert_eq! in tests to get more readable debug output.
Now that gitoxide is thinking about stability as well it became clear that without a 1.0 bstr release this couldn't ever happen. I am glad to see this is actively explored here and will be looking forward to the eventual 1.0 release.
For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding. Turning these back into &[u8] would be something so regrettable that it simply has to be avoided.
For reference, not using &BStr in gitoxide would be a great degradation of usability of parsed git objects which can't be forced into UTF-8 encoding.
Could you say more about this? In particular, I would like to hear more about its importance in public APIs.
A good example for its use is a freshly parsed commit whose memory is backed by a buffer somewhere else. The goal is to make clear that these fields represent user-readable strings, but in Rust there is no way to do this without enforcing an encoding or a custom type. That custom type is BStr and once I found it there was no way I would roll my own.
It also works nicely for the mutable version of such a commit.
When testing, working with BString/BStr felt natural and worked exactly as if one would expect.
Note that I put the bstr crate into public APIs knowing it goes against the prominent advice provided in the readme file, that's how indispensable it was 😅.
From there it only proliferated and it's now used in no less than 13 gitoxide crates.
Could you say more about this? In particular, I would like to hear more about its importance in public APIs.
To find a more concise answer: bstr is just too practical and there is a huge void for strings without known encoding in Rust that it seems to fill well for my use.
@Byron Ah wow, that is great feedback. I agree that using BStr/BString in those APIs is probably the right thing, because otherwise the derived Debug representation of those types is going to be totally useless I imagine.
I don't think there is much work to be done to get bstr to 1.0. I'll do my best to prioritize it in the next month or so.
When do you plan to release 1.0 versions of crates that depend on bstr?
Thanks so much! Part of me was afraid I'd be told it's all wrong as it was so strongly discouraged 😅.
Also thanks for the prioritization, even though I think no matter how long it takes you, I will be slower. If it happens at the end of the year I would be served well already.
Ah perfect. My time is super limited and devoting pretty much all of what little free time I have to regex. But I'll do my best to take a detour and whip bstr into a 1.0 release. I think it has had more than enough time to bake and there don't appear to be any obvious design flaws when compared to alternatives.
Having this stable would also enable its use with the riot-wrappers crate, where for example process names are exposed (they're by all probability ASCII even, but an unsafe assume-it-is opens up for UB if a C user on the same system does something weird, and converting them to &str means pulling in UTF-8 checks, and either fallible operations or panics).
No hurry from me (I'll need to wait with this for my next API bump anyway), just a data point, and thanks for working on this!
Since there hasn't been a lot of movement in the past 9 months and since bstr is a very, very important dependency for gitoxide and probably countless other crates, I wonder what I can do to help. The reason for me getting a bit more pushy is the ongoing work to get gitoxide into cargo.
I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.
Thank a lot :)!
Folks, I've updated the top comment of this issue to include a list of breaking changes (all of which are minor). My plan is to push out the 1.0 release on Monday, July 11, 2022. So please, if you have comments or feedback or other ideas for breaking changes, now is the time to do it.
I am hereby happily offering to step in as collaborator or maintainer and serve as helping hand for whatever @BurntSushi deems right.
For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself. (Which ended up taking almost a full day of work.) It's important to remember this when offering to maintain a project. It's not as simple as me just clicking a button and giving someone permissions. :-)
For what it's worth, it would have probably been an order of magnitude more work to onboard someone else to do the 1.0 release than to just do it myself.
That's certainly true, yet I find it worth clarifying that my offer wasn't limited to the 1.0 release. In any case, thanks for this wonderful crate and the continued maintenance efforts!
I sincerely hope that the version of bstr after 1.0 is simply std (or, more precisely, core and alloc as appropriate).
Thank you for such a fundamental building block of the ecosystem.
Note that I've published 1.0.0-pre.1 to crates.io. It's not on docs.rs yet though.
Might as well ask here: Why does for_byte_record_with_terminator in BufReadExt consume self / the underlying BufRead? It doesn't seem to be necessary to me.
Indeed, this is I believe a good point. In fact, all of the methods on BufReadExt consume self. I believe the only ones we actually want to consume self are byte_lines and byte_records, since they each return an iterator generic over Self. But the other methods all execute closures, and so I suspect &mut self would be more appropriate there. (It's likely this isn't a show-stopping issue in practice since you can probably always convert your T: BufReadExt to a &mut T: BufReadExt and get away with calling the for_ methods without actually consuming your reader, but it does look like a wart.)
I read and understand you want bstr to allow for more possibilities around bytes: -string of bytes conforming to utf-8 -string of bytes non-conforming to utf-8, but it's encoding is yet to be determined.
Yesterday, an important and lengthy cloud backup failed to complete. Why? Because the name of either the file or directory contained MACOS emojis and the backup's destination service does not like those kinds of characters(non-conformant). It's related because the file system on MACOS/UNIX/BSD accepts those character ranges be it visible or invisible. Linux xfs/btrfs/ext4 don't have any issues with those character ranges either.
Here is short specific example of an error caused by non-conformance. It may not seem like a big thing, but it is considering it's for a company's backup of data and it delays the backup for the entire set of data. Human intervention is necessary at this point.
rsync --archive someSourceDir/ /mnt/tapedrive/someDestinationDir/
within some subdirectory however there is a file called "blah:foo" notice the colon in there. The backup continues for hours then errors out on the ':' character and does not backup the file in question. If it's a directory name with a colon, it does not backup the directory in question. In other words it complicates matters. Why is that? LTFS formatted tape file systems don't like that ':' colon character in file/directory names. So the use case scenario with an emoji in a filename or directory name brought up another error but this time not on tape backups, but on cloud backups. These are big deals on traditional UNIX-based file systems, but it's a big deal on anything that isn't. I'm guessing AWS S3 and other cloud services have issues like this.
The above brings me to suggest these in the bstr API for helping determine if filenames and directory names are going to be ok before they land on said backup storage destination: bstr::Is_LTFS_Conformant() bstr::Is_S3_Conformant() bstr::Is_Google_Cloud_BitbucketName_Conformant() bstr::Is_Optical_UDF_FS_Conformant()
Oddly enough the only way I've found to circumvent these issues is by tar'ing someSourceDir and then rsync'ing the tar to said destination backup storage, but that's not always the ideal situation. It would be better to find those non-conformant filenames/directory names beforehand and propose some auto-correcting measures to conform to the respective storage range of characters and such.
Why am I mentioning it here? It's because strings, bstr, path are the usual go-to types for holding filenames and directory names. It would be good to have such helper tools available to help everybody save time when doing their own backups without experiencing any errors.
Also to further prevent these kinds of interoperability issues, why not place in the filesystem drivers new constraints that disallow emojis and other unwanted character ranges as filenames and directory names from the beginning. i.e. the openfile/createfile would disallow ':', emojis and such on every operating system. Get all the operating system and storage providers to collaborate on this once and for all.
Thank you for listening. Cheers.
@omac777 Let's please try to keep this thread focused on 1.0 concerns. This isn't an RFC for any arbitrary feature requests, but issues specific to a 1.0 release. I've moved your comment into #109 and responded there.
OK, the docs for what's currently on master have been published: https://docs.rs/bstr/1.0.0-pre.1/bstr/
(I haven't made the changes to stop consuming self for some of the methods on BufReadExt yet.)
I think helpful to address would be issues that could potentially lead to breaking changes before 1.0. Not truly understanding if this is the case or not with what I am about to state, let me share one convenience issue in particular: I am having trouble sometimes to create a BStr instance, usually in situations where type inference bails out preventing "string".into() to work.
Recently I saw code that had to do [].as_bstr() to get an empty &BStr because type inference wouldn't know that a BStr was required. Maybe it's easy enough to have a constructor for empty BStr, and also a function to create new BStr instances directly, like bstr::BStr::empty() or bstr::BStr::new("hello"), the latter similar to std::path::Path::new(…).
@Byron I think these are the related issues for that, right? #84 and #86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.
Basically, when I created BString and BStr, I specifically tried to leave those types as devoid of methods as I could, since their principle purpose is to act as a Deref point for Vec<u8> and &[u8], respectively, and as a target for trait impls that want "byte string" specific behavior (like std::fmt::Debug, serde::Serialize and serde::Deserialize). Adding more methods on to them makes the deref situation a little more dicey. But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.
One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that. Thankfully, it's not a breaking change though.
@Byron I think these are the related issues for that, right? https://github.com/BurntSushi/bstr/issues/84 and https://github.com/BurntSushi/bstr/issues/86. I specifically skipped over those because I do not believe there are any breaking changes required to fix those things. But maybe I'm wrong.
#84 is definitely related, maybe having other ways to create a &BStr would reduce some pressure there.
But if one can convince me that it's okay, then I'd generally be happy to add methods or constructors to them. To be honest, I just haven't done the investigation required for that myself.
Thanks for your openness! It's certainly not the case here either as I am far from an actual investigation, all I have is nearly every-day usage and some returning pains. I think I was struck by fear that fixing any of these would lead to another breaking change which better happens soon then.
One possible "idiom" here is that I believe B("").as_bstr() will always work. But if we can convince ourselves that a BStr::new is OK (with basically the same signature as B, except it returns a &BStr), then I'd be more than happy to add that.
BStr::new() seems like the most intuitive, and for now it doesn't appear to conflict with the Deref targets either. I'd be willing to deal with it when it happens, and take the risk, for more convenience now. It's certainly a small thing as well, I found myself creating little helper methods for that throughout the codebase.
I found myself creating little helper methods for that throughout the codebase.
Can you create a new issue with these? That is, just copy and paste your helper routines. (Or if they fit better in an existing issue, that's fine too.)
I've added another breaking change: restructure serde feature flags
Now that bstr has an 'alloc' feature, we need to rethink how we setup the serde feature flags. Previously, all we had was 'std' and 'no std'. But now we have 'std', 'alloc only' and 'core only'. In particular, 'no std' is split into 'alloc only' and 'core only', since neither one bring in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std', and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.
I've added another breaking change: several methods on BufReadExt now take &mut self instead of self.
Now that bstr has an 'alloc' feature, we need to rethink how we setup the serde feature flags. Previously, all we had was 'std' and 'no std'. But now we have 'std', 'alloc only' and 'core only'. In particular, 'no std' is split into 'alloc only' and 'core only', since neither one bring in std. To reflect this trichotomy, we rename 'serde1' to 'serde1-std', and split 'serde1-nostd' into 'serde1-alloc' and 'serde1-core'.
Would it be worth bumping the MSRV to 1.60 and use weak / namespaced feature flags?
This way you have
- serde1could possibly be named- serde
- stddepends on- serde?/std
- allocdepends on- serde?/alloc
- Optional dependencies that are not meant to be features can be properly hidden with dep:
@epage Yeah I mused about this here: https://github.com/BurntSushi/bstr/pull/111#issuecomment-1180477086
Basically, I kind of feel like 1.60 is really way too new. My first approximation is to track Debian stable (which is at Rust 1.48), but in principle, I'm okay with something newer.
Unfortunately, there does seem to be a fundamental conflict here. Namely, if we move forward with the existing trichotomy of serde features, then it's going to have to stay that way since I think moving to the scheme you propose would be a breaking change. (Not the MSRV bump, that's fine, but changing the features around.) Since I don't see myself putting out a bstr 2.0 any time soon, it would effectively freeze bstr into a slightly sub-optimal feature configuration. But if I do adopt the nicer feature setup, then I'm being somewhat aggressive with the MSRV.
I personally don't mind being a bit aggressive, but I suppose it depends on my users.
@lopopolo @thomcc @Byron @TethysSvensson How do you feel about Rust 1.60 as the MSRV?
With respect to MSRV, if Rust 1.60 is too new for some folks, then I suppose they could stick with bstr 0.2 until Rust 1.60 is no longer too new.
Given there's a question open here, I'm going to delay the 1.0 release another week.
Speaking of
ByteVec::into_os_string now returns Result<OsString, FromUtf8Error> instead of Result<OsString, Vec
>. 
In the docs, it says
- One could re-implement WTF-8 and re-encode file paths on Windows to WTF-8 by accessing their underlying 16-bit integer representation. Unfortunately, this isn’t zero cost (it introduces a second WTF-8 decoding step) and it’s not clear this is a good thing to do, since WTF-8 should ideally remain an internal implementation detail. ...
While this library may provide facilities for (1) in the future, currently, this library only provides facilities for (2) and (3).
(1) could be simplified, depending on what is done with OsStr, like https://github.com/rust-lang/rust/pull/95290
Should the existing functions have more specific names to open the door for the more general functions to be added later, if possible?
Non-blocker for 1.0: would it also be worth pointing people to os_str_bytes to help people who want (1)?
EDIT: Now split out as #115, #116
As a usability note, I wrote an application with BStr in the internal APis and kept getting tripped up that I needed as_bstr() or .map(ByteSlice::as_bstr) at the end of any transformation I did on a BStr to get it back into its own type.  In the ideal world, the output type would match the input type.
I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.
EDIT: Split out into #117
@BurntSushi 1.60.0 MSRV works for me. Applications I build with bstr target the latest stable and library crates I've built with bstr have a "bump MSRV in minor versions" policy which I'd be happy to exercise to accommodate a higher bstr MSRV.
On a related note, I've been using the weak features and namespaced features in recent cargo and have found them to be quite pleasant to work with. Using serde1 = ["dep:serde", ...] means that a --features serde flag is no longer valid like it is now.
As a usability note, I wrote an application with
BStrin the internal APis and kept getting tripped up that I neededas_bstr()or.map(ByteSlice::as_bstr)at the end of any transformation I did on aBStrto get it back into its own type. In the ideal world, the output type would match the input type.I'm assuming this is fully intentional and won't be changing but figured it was still worth noting.
Right yeah. That's what bstr 0.1 was: https://docs.rs/bstr/0.1.4/bstr/
How do you feel about Rust 1.60 as the MSRV?
I think it's probably the right choice in this situation, given that you are trying to stabilize.
In planus, our base policy is to only bump the MSRV to versions that are at least six months old, which would mean that we could not start using bstr until October. That being said, it's a policy we try to follow, but break when there is sufficient reason to do so.
(While writing this, I am also realizing that our MSRV simultaneously says 4 versions of rust and 6 months, which are two very different things)
@lopopolo Yeah the new Cargo feature stuff is awesome. I very much like that it gives more control over what's a public feature.
@epage See https://github.com/BurntSushi/bstr/issues/5 and https://github.com/BurntSushi/bstr/pull/8 for more details on why bstr 0.1 -> 0.2 went from concrete BString/BStr types to extension traits on Vec<u8>/[u8]. It is slightly sad that if you have a BStr and call an extension trait method that returns a subslice of it that you get a &[u8] back instead. I'm not sure there is any simple machinery that would let me fix that while maintaining the existing extension trait API on [u8].
Happy to answer more questions on that but would prefer opening a new issue for it if you have more questions.
How do you feel about Rust 1.60 as the MSRV?
I'd be OK with it, thanks for asking.
Hi @BurntSushi I know this is late in the game for 1.0, but I wanted to float an idea of an optional dependency on simdutf8 for a SIMD-accelerated fast path to crate::utf8::validate:
https://github.com/BurntSushi/bstr/blob/0d9d22237f859d359ec676cff255b848c772e5a3/src/utf8.rs#L463-L471
I'm not sure how you'd want to structure this, but maybe an optional performance feature a la regex (that maybe does nothing for 1.0 but leaves the possibility for additional deps open in subsequent releases?).
Edit: coming back to this, I'm actually not sure whether I should expect the DFA in bstr or SIMD routines in simdutf8 to be faster. I just associate SIMD with 🏎️
Yeah that's not a 1.0 concern. I'm unlikely to take a dependency for that. Instead, I would rather see the implementation ported to bstr or maybe even memchr. That said, I've never looked at it in depth, so I'm unsure of its complexity.
Feel free to open a new issue about this to avoid cluttering up the 1.0 release thread.
For this thread, I would really like to keep it scoped to 1.0 concerns.