bstr
bstr copied to clipboard
ByteSlice::trim on ASCII whitespace is substantially slower than core::str::trim
ByteSlice::trim (and related) are not competitive with libstd's in the case that the whitespace is ASCII.
The difference is as much as 50%, and is something I noticed when moving some code to use bstr, as a dip in that code's benchmarks.
I'm not an expert, but my understanding is that ASCII whitespace is much more common than non-ASCII whitespace in pretty much all scripts, so it's probably a good idea to optimize for.
Here are two benchmarks that demonstrate the issue: https://gist.github.com/thomcc/d017dec2bf7fbfd017e4f34cfd4db6f8 — it's a gist as it's a bit too long to really be great as a code block. It also contains a diff you can apply to insert them directly into bstrs existing benchmark code (2nd file in the gist).
The first (trim/source-lines) measures the time to trim a bunch of lines of source code (specifically, every line in ext_slice.rs — chosen arbitrarily), and is close to my real use case where, I saw an issues using bstr.
The second (trim/large-ascii-padded) is completely artificial, and just trims a huge string starting and ending with tons of ascii whitespace (with only a single non-whitespace character between it all to ensure both trim_start and trim_end are measured). It's focused on the specific issue, so probably better as a benchmark, but it doesn't reflect a real use case.
The results here show that for the current benchmark (trim/tiny), std and bstr are roughly the same performance, but that std is substantially faster on the the new benchmarks
bstr/trim/tiny time: [50.634 ns 50.925 ns 51.261 ns]
thrpt: [502.31 MiB/s 505.63 MiB/s 508.54 MiB/s]
std/trim/tiny time: [50.592 ns 50.743 ns 50.917 ns]
thrpt: [505.71 MiB/s 507.45 MiB/s 508.96 MiB/s]
bstr/trim/source-lines time: [90.672 us 90.931 us 91.222 us]
thrpt: [1.1964 GiB/s 1.2003 GiB/s 1.2037 GiB/s]
std/trim/source-lines time: [55.251 us 55.669 us 56.236 us]
thrpt: [1.9408 GiB/s 1.9605 GiB/s 1.9754 GiB/s]
bstr/trim/large-ascii-padded
time: [9.4068 us 9.4174 us 9.4304 us]
thrpt: [414.32 MiB/s 414.89 MiB/s 415.36 MiB/s]
std/trim/large-ascii-padded
time: [4.1390 us 4.1472 us 4.1559 us]
thrpt: [940.15 MiB/s 942.12 MiB/s 943.99 MiB/s]
I took a quick stab at improving it by replacing https://github.com/BurntSushi/bstr/blob/master/src/unicode/whitespace.rs#L7 with this:
pub fn whitespace_len_fwd(slice: &[u8]) -> usize {
let mut i = 0;
while i < slice.len() && slice[i].is_ascii_whitespace() {
i += 1;
}
if i == slice.len() || slice[i].is_ascii() {
i
} else {
WHITESPACE_ANCHORED_FWD.find_at(slice, i).unwrap_or(i)
}
}
Which helped the improve the benchmarks above by about 30%, although it hurt the existing benchmark by around 10%.
I couldn't quite avoid off by one errors in the _rev version (and I'm not 100% certain I've avoided them in the _fwd, tbh — there are probably bugs in the transition between ascii and unicode there). I'm not really sure this is an ideal approach anyway, so I figured I'd just report the issue rather than spend more time debugging it.
@thomcc Thanks for diving into this! I don't quite have the bandwidth to dive into this right now. I will at least do it whenever I get back around to releasing 1.0 in order to minimize context switching. If you do want to submit a PR with your current work, then I think that sounds good to me given the 10% loss but 30% gain. But more broadly, I certainly agree that optimizing for the ASCII case makes sense. I'd be happy to do that even if it makes the fully general Unicode case a good deal worse.
Just to re-iterate my previous comment: I'm generally okay with giving up a small loss for non-ASCII in favor of making the ASCII case much faster.
I'm not going to get to this for 1.0 though, which is okay, since this is an internal detail.