json icon indicating copy to clipboard operation
json copied to clipboard

SIMD vectorization behind a feature flag

Open dtolnay opened this issue 9 years ago • 17 comments

RapidJSON does quite a lot of this.

dtolnay avatar Jul 18 '16 01:07 dtolnay

It would be interesting to see hermetic tests done for that vs iteration with a LUT. You need to:

  • check how many vectors you can use (len >> 4, not a big deal)
  • vectorize the bytes - memcpy should be fast
  • do 4? instructions (check for backslash, quote, control characters with the exception of whitespace)
  • flatten the boolean vector to a single value.

And then you still need to do something about the remaining len % 16 bytes.

maciejhirsz avatar Jul 18 '16 09:07 maciejhirsz

I made a very rough test in json-rust. Using this as test data:

{"foo":"This is a super long string, omg so long, why is it so long? It is so long because it needs to contain many multiplications of 16 bytes in order for us to get any benefits from SIMD at all! That's why! It's kind of annoying like that, but hey, what can you do..."}

Without SIMD:

test json_rust_parse ... bench:         414 ns/iter (+/- 143) = 657 MB/s

With SIMD:

test json_rust_parse ... bench:         267 ns/iter (+/- 28) = 1018 MB/s

This is pretty much the best case scenario for this kind of a thing, but it works. The implementation is pretty simple, before I start running a loop checking bytes one by one, I do this:

        for _ in 0 .. ($parser.length - $parser.index) >> 4 {
            let bytes = simd::u8x16::load($parser.source.as_bytes(), $parser.index);

            if (bytes.lt(CT_SIMD) | bytes.eq(BS_SIMD) | bytes.eq(QU_SIMD)).any() {
                break;
            }

            $parser.index += 16;
        }

Where CT_SIMD is a splat of 0x20, BS_SIMD is a splat of 0x5C and QU_SIMD is a splat of 0x22 (as you'd expect).

Gonna run this against json-benchmark, will come back with numbers.

maciejhirsz avatar Jul 19 '16 13:07 maciejhirsz

On json-benchmark the numbers are pretty much the same on all tests. The version with SIMD is on dev branch if you want to try on your hardware. I only did it for parsing, not serializing.

maciejhirsz avatar Jul 19 '16 14:07 maciejhirsz

Same here, even a little slower. It may be that unaligned u8x16 loads are slow - you don't have this part that establishes 16-byte alignment. That would also serve to mitigate the impact on parsing short strings.

I would be interested in benchmarking two variants of your code - one that aligns to 16 bytes before the SIMD loop like in RapidJSON, and one that goes 16 bytes past the first 16-byte alignment before starting SIMD. It is all about balancing no-SIMD for short strings (which are most strings) and SIMD for long strings (which are slow otherwise).

dtolnay avatar Jul 19 '16 15:07 dtolnay

I think the simd lib does the alignment via the repr annotation:

/// A SIMD vector of 16 `u8`s.
#[repr(simd)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[derive(Debug, Copy)]
pub struct u8x16(u8, u8, u8, u8, u8, u8, u8, u8,
                 u8, u8, u8, u8, u8, u8, u8, u8);

I've been only catching up with low level stuff this year, so things like byte alignments are still semi exotic to me. I've only looked this up, and the author of the answer is the author of the simd implementation.

maciejhirsz avatar Jul 19 '16 16:07 maciejhirsz

I am talking about alignment of $parser.source[$parser.index]. If that is not 16-byte aligned, there is nothing the SIMD library can do about it.

dtolnay avatar Jul 19 '16 17:07 dtolnay

Gotcha! I assume casting pointer into usize and grabbing modulo 16 out of it, then running over that many bytes with regular check should be sufficient.

maciejhirsz avatar Jul 19 '16 17:07 maciejhirsz

Yes.

And then for the other test, that many bytes + 16 more.

dtolnay avatar Jul 19 '16 17:07 dtolnay

I don't seem to get any statistically significant difference. Increased the string length to 3155 characters.

No SIMD:

test json_rust_parse ... bench:       2,592 ns/iter (+/- 1,071) = 1221 MB/s

SIMD, no alignment:

test json_rust_parse ... bench:         683 ns/iter (+/- 481) = 4633 MB/s

SIMD, with alignment:

test json_rust_parse ... bench:         754 ns/iter (+/- 502) = 4197 MB/s

SIMD, with alignment offset by 1:

test json_rust_parse ... bench:         755 ns/iter (+/- 358) = 4192 MB/s

Pushed to dev along with quickly hacked benchmarks if you want to play around.

maciejhirsz avatar Jul 19 '16 17:07 maciejhirsz

Ok, got it, seems u8x16::load doesn't actually use memcpy. Reworked into this mess:

let mut bytes: simd::u8x16 = mem::uninitialized();
let bytes_ptr: *mut u8 = mem::transmute(&mut bytes);
ptr::copy_nonoverlapping(
    $parser.byte_ptr.offset($parser.index as isize),
    bytes_ptr,
    16
);

Now I get expected, results. With alignment:

test json_rust_parse ... bench:         625 ns/iter (+/- 321) = 5064 MB/s

With alignment offset 1:

test json_rust_parse ... bench:         651 ns/iter (+/- 367) = 4861 MB/s

Difference isn't staggering, but it's there.

maciejhirsz avatar Jul 19 '16 18:07 maciejhirsz

Wow, that's interesting! I also have to say that all modern (at least Intel's) CPUs have AVX2 in them, which allows 256 bit vectors. Would be nice to try. This must be #[cfg]'d of course.

alexbool avatar Jul 27 '16 07:07 alexbool

SIMD is on track for stabilization 🎉

arthurprs avatar Feb 09 '18 12:02 arthurprs

Has anyone done anything with this?

ijl avatar Jan 23 '19 00:01 ijl

This recently became a thing: https://github.com/lemire/simdjson

mmstick avatar Feb 23 '19 14:02 mmstick

I ported simdjson to rust the performance boost SIMD gives is significant. Even skipping the tape tricks and adopting a more rust-ergonomic data representation it is up to 300% faster than vanilla-rust serde-json

Licenser avatar Jun 20 '19 20:06 Licenser

How has the discussion progressed so far? It's been 7 years already.

JimChenWYU avatar Nov 15 '23 09:11 JimChenWYU