bstr ByteSlice::trim on ASCII whitespace is substantially slower than core::str::trim

trafficstars

ByteSlice::trim (and related) are not competitive with libstd's in the case that the whitespace is ASCII.

The difference is as much as 50%, and is something I noticed when moving some code to use bstr, as a dip in that code's benchmarks.

I'm not an expert, but my understanding is that ASCII whitespace is much more common than non-ASCII whitespace in pretty much all scripts, so it's probably a good idea to optimize for.

Here are two benchmarks that demonstrate the issue: https://gist.github.com/thomcc/d017dec2bf7fbfd017e4f34cfd4db6f8 — it's a gist as it's a bit too long to really be great as a code block. It also contains a diff you can apply to insert them directly into bstrs existing benchmark code (2nd file in the gist).

The first (trim/source-lines) measures the time to trim a bunch of lines of source code (specifically, every line in ext_slice.rs — chosen arbitrarily), and is close to my real use case where, I saw an issues using bstr.

The second (trim/large-ascii-padded) is completely artificial, and just trims a huge string starting and ending with tons of ascii whitespace (with only a single non-whitespace character between it all to ensure both trim_start and trim_end are measured). It's focused on the specific issue, so probably better as a benchmark, but it doesn't reflect a real use case.

The results here show that for the current benchmark (trim/tiny), std and bstr are roughly the same performance, but that std is substantially faster on the the new benchmarks

bstr/trim/tiny          time:   [50.634 ns 50.925 ns 51.261 ns]
                        thrpt:  [502.31 MiB/s 505.63 MiB/s 508.54 MiB/s]
std/trim/tiny           time:   [50.592 ns 50.743 ns 50.917 ns]
                        thrpt:  [505.71 MiB/s 507.45 MiB/s 508.96 MiB/s]

bstr/trim/source-lines  time:   [90.672 us 90.931 us 91.222 us]
                        thrpt:  [1.1964 GiB/s 1.2003 GiB/s 1.2037 GiB/s]
std/trim/source-lines   time:   [55.251 us 55.669 us 56.236 us]
                        thrpt:  [1.9408 GiB/s 1.9605 GiB/s 1.9754 GiB/s]

bstr/trim/large-ascii-padded
                        time:   [9.4068 us 9.4174 us 9.4304 us]
                        thrpt:  [414.32 MiB/s 414.89 MiB/s 415.36 MiB/s]
std/trim/large-ascii-padded
                        time:   [4.1390 us 4.1472 us 4.1559 us]
                        thrpt:  [940.15 MiB/s 942.12 MiB/s 943.99 MiB/s]

May 05 '21 22:05 thomcc

I took a quick stab at improving it by replacing https://github.com/BurntSushi/bstr/blob/master/src/unicode/whitespace.rs#L7 with this:

pub fn whitespace_len_fwd(slice: &[u8]) -> usize {
    let mut i = 0;
    while i < slice.len() && slice[i].is_ascii_whitespace() {
        i += 1;
    }
    if i == slice.len() || slice[i].is_ascii() {
        i
    } else {
        WHITESPACE_ANCHORED_FWD.find_at(slice, i).unwrap_or(i)
    }
}

Which helped the improve the benchmarks above by about 30%, although it hurt the existing benchmark by around 10%.

I couldn't quite avoid off by one errors in the _rev version (and I'm not 100% certain I've avoided them in the _fwd, tbh — there are probably bugs in the transition between ascii and unicode there). I'm not really sure this is an ideal approach anyway, so I figured I'd just report the issue rather than spend more time debugging it.

May 05 '21 22:05 thomcc

@thomcc Thanks for diving into this! I don't quite have the bandwidth to dive into this right now. I will at least do it whenever I get back around to releasing 1.0 in order to minimize context switching. If you do want to submit a PR with your current work, then I think that sounds good to me given the 10% loss but 30% gain. But more broadly, I certainly agree that optimizing for the ASCII case makes sense. I'd be happy to do that even if it makes the fully general Unicode case a good deal worse.

May 06 '21 12:05 BurntSushi

Just to re-iterate my previous comment: I'm generally okay with giving up a small loss for non-ASCII in favor of making the ASCII case much faster.

I'm not going to get to this for 1.0 though, which is okay, since this is an internal detail.

Jul 05 '22 23:07 BurntSushi

bstr bstr copied to clipboard

ByteSlice::trim on ASCII whitespace is substantially slower than core::str::trim

bstr
bstr copied to clipboard