bstr
bstr copied to clipboard
ascii.rs optimization since ptr is aligned
On line 186, there is a call to_mm_loadu_si128:
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L186
That call does not require the pointer to be aligned at all. But it could be replaced with _mm_load_si128, which requires alignment to a 16-byte boundary.
On line 139, the ptr is aligned to the VECTOR_ALIGN mask, which aligns it to size_of::<__m128i>(), or to 16 bytes.
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L120-L121
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L139C9-L139C12
The ptr is always advanced by VECTOR_LOOP_SIZE in a loop, which is a multiple of 16 bytes:
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L120-L122
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L180C36-L180C52
And then further advanced by VECTOR_SIZE which is 16 bytes:
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L120
https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L191
So in the loop at L186, the pointer is always aligned to 16 bytes: https://github.com/BurntSushi/bstr/blob/3dc5939f30daa1a8a6e5cc346bb77841f19ea415/src/ascii.rs#L186
The aligned version of the function was pointed out by @anforowicz during unsafe code audit: https://chromium-review.googlesource.com/c/chromium/src/+/5925797/comment/f08dc00c_1b24061c/
Hmmm I think actually there's probably a bug here. We probably should be making use of unaligned loads here to avoid the slow byte-at-a-time loop. Like this:
https://github.com/BurntSushi/memchr/blob/a26dd5b590e0a0ddcce454fb1ac02f5586e50952/src/arch/generic/memchr.rs#L219-L228
(Although that interestingly does use an unaligned load just before when it could be an aligned load.)
Aligned versus unaligned rarely makes a difference on x86-64 in my experience anyway, and especially here where this is just the "cleanup" aspect of the implementation.
But good catch. I agree that as the code is currently written, it can be safely an aligned load.