bstr icon indicating copy to clipboard operation
bstr copied to clipboard

ascii.rs optimization since ptr is aligned

Open danakj opened this issue 1 year ago • 1 comments

On line 186, there is a call to_mm_loadu_si128: https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L186

That call does not require the pointer to be aligned at all. But it could be replaced with _mm_load_si128, which requires alignment to a 16-byte boundary.

On line 139, the ptr is aligned to the VECTOR_ALIGN mask, which aligns it to size_of::<__m128i>(), or to 16 bytes. https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L120-L121 https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L139C9-L139C12

The ptr is always advanced by VECTOR_LOOP_SIZE in a loop, which is a multiple of 16 bytes: https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L120-L122 https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L180C36-L180C52

And then further advanced by VECTOR_SIZE which is 16 bytes: https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L120 https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L191

So in the loop at L186, the pointer is always aligned to 16 bytes: https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L186

The aligned version of the function was pointed out by @anforowicz during unsafe code audit: https://chromium-review.googlesource.com/c/chromium/src/+/5925797/comment/f08dc00c_1b24061c/

danakj avatar Oct 23 '24 20:10 danakj

Hmmm I think actually there's probably a bug here. We probably should be making use of unaligned loads here to avoid the slow byte-at-a-time loop. Like this:

https://github.com/BurntSushi/memchr/blob/a26dd5b590e0a0ddcce454fb1ac02f5586e50952/src/arch/generic/memchr.rs#L219-L228

(Although that interestingly does use an unaligned load just before when it could be an aligned load.)

Aligned versus unaligned rarely makes a difference on x86-64 in my experience anyway, and especially here where this is just the "cleanup" aspect of the implementation.

But good catch. I agree that as the code is currently written, it can be safely an aligned load.

BurntSushi avatar Oct 23 '24 23:10 BurntSushi