simdjson Allow progressively building bitmask

This PR adds a simd_bitmask64_builder class, which lets you progressively build a 64-bit bitmask from simd8<bool> (the results of == and friends on a simd register). It uses it in all simd8x64<bool>.to_bitmask() implementations. It is designed to have identical performance to to_bitmask() today, but allow it to be called after each SIMD register is processed, throwing away whatever it can on the platform allowed.

For Intel platforms, this just means shifting the bits and or'ing them into a register. For other platforms such as ARM, it's something like a reduce operation, progressively smushing together registers with 16-to-8-bit adds until they end up in a single SIMD-sized registers.

The purpose of this is to move towards a world where we can process one SIMD-sized chunk of bytes at a time before upscaling to 64-bit masks, before upscaling to SIMD-sized masks. We're going to run out of registers if we keep just stashing everything in a simd8x64 or simd8x512 and processing them all at once at the end. This lets us be explicit and try to process everything we reasonably can as we process each SIMD register. The next step on that journey will be similar utilities to build a SIMD-sized bitmask progressively, over 8 steps--the savings from doing reduction become much more apparent there, requiring up to a maximum of 4 registers per bitmask if you are doing as few operations as possible with as much parallelism as possible (as we do today in to_bitmask().

I have another draft built on this which allows the UTF-8 checking algorithm to run one SIMD register at a time as well.

DRAFT: I have not yet tested many of the platforms--just Haswell, which was running at the same performance as today on my machine.

Aug 17 '24 04:08 jkeiser

The failures appear to be due to my deleting one of the two apparently redundant copies of json_path_to_pointer_conversion since they were causing compile errors in clangd ... I'm not sure which one was intended, though perhaps the fact that the one I did preserve didn't compile at first should have been a clue :)

Aug 17 '24 04:08 jkeiser

Woohoo!

Aug 17 '24 05:08 lemire

I haven't looked at simdjson internals as much as I should, but I strongly believe the goal should be to use interleaved vectors from LD4 for everything on ARM. (two LD4's on 32-bit ARM, if you support that)

In https://github.com/simdutf/simdutf/issues/428, I show that 3 ext's per 16 byte vector can be reduced to 3 ext's per 64 byte chunk. That's 16 total vectors reduced to 7. I thought of this technique for the Accelerated-Zig-Parser which does a single LD4 per 64-byte chunk and uses that for everything (not all my updates are public yet though). I also believe that 4 sri's and a shrn are the best movemask routine compared to vpadds. It's only 5 instructions, requires no constants, I believe matches the speed of vpadd techniques (although I'd like to hear better ideas), and works on 32-bit ARM (no vpadd's allowed). It also seems to me that for high performance ARM CPU's you want to run at least 4 16-byte vector ops at once (or pipelined, starting a new one each cycle). If your goal is just to change the order in which instructions are emitted in the hopes of less spilling to the stack (have you looked at the assembly?), then maybe it doesn't hurt, but in general I think moving to checking one 16 byte chunk at a time is a step backwards.

I'm just addressing your stated eventual goal in your paragraph by the way, not your PR per se.

I'd be really interested to hear if you disagree, since I'd like to get better performance on ARM as well!

Aug 17 '24 23:08 Validark

That definitely sounds interesting! I haven't quite grokked it yet, but UTF-8 may well work better in 64 bytes on ARM. The main reason I brought ARM up is that its algorithm is already designed to generate 128-bit bitmasks, but we're doing a redundant op and stopping at 64 bits right now and doing an extra redundant op to finish it up and extract the bitmasks. I don't actually have an ARM machine handy to test on, but that's part of what has me interested.

Part of the reason to put this up right now is to do the experiment and see what effect it actually has :) The thing that's really caught my fancy is the idea of doing a decent portion of simdjson's parsing in 128-512-bit masks representing booleans for 128-512 bytes. That obviously doesn't apply to UTF-8, where the algorithm is heavily byte-focused, so there I'm mainly trying to validate whether incrementally building bitmasks can be done without losing performance vs. the current algorithm, so we can tune the pressure+parallelism up and down at will.

Aug 18 '24 03:08 jkeiser

I haven't quite grokked it yet

I think you or other people might be overthinking that issue? All I'm describing is a reordering.

Normal byte order is:

0 1 2 3 4 5 6 7 8 9 A B C D E F

We want to compare (or do some operation, doesn't matter) each byte to the byte that came before. What bytes came before those (at each position)? Well, these:

-1 0 1 2 3 4 5 6 7 8 9 A B C D E

(Column-wise) You just subtract one from each byte, that's the position you need to compare to. LD4 loads data like so:

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

Therefore, to reproduce the normal pairing (0 => -1, 1 => 0, 2 => 1, 3 => 2, 4 => 3), we subtract 1 from each of these to get the position of the bytes to compare against.

-1 3 7 11 15 19 23 27 31 35 39 43 47 51 55 59

That happens to be almost the same as one of the vectors LD4 loads in:

3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63

Hope this explanation helps

Aug 18 '24 05:08 Validark

@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's only almost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :)

Aug 18 '24 18:08 jkeiser

@Validark it's a really interesting idea and has some promise! I haven't had the time to really sit down with it, is the main reason I don't fully grok. I get how the prev is automatically loaded up for a lot of the masks, but as you noted, it's only almost loaded (the 2 bottom bitmasks are missing some stuff that exists only in the previous chunk). I'm sure you already have a solution, my brain is just a little full at this exact moment :)

At the end of each iteration you hold on to the latter 3 vectors from LD4, and on your next LD4, you shift in the upper element from each of those into the bottom byte of the new latter 3 vectors. This is 3 ext instructions, all parallel, and you can overwrite the stored vectors, so you still only have 7 vectors in play at once, with which you can line up any combination of prev0 (current vector), prev1, prev2, or prev3 for whatever you want!

I guess I didn't spell this part out before but when you move on to the next 64-byte chunk (0, 1, 2, 3, ...), the old 63 becomes your -1 byte, the old 62 is your -2, and the old 61 is your -3.

Aug 18 '24 21:08 Validark

Hey @jkeiser, I just moved my blog over to a new framework and wrote an article on the subject. Maybe this helps?

https://validark.github.io/posts/interleaved-vectors-on-arm/

Sep 03 '24 22:09 Validark