flexible-vectors [WIP] SVE-like flexible vectors

Hello everyone, this is an attempt to formalize flexible vectors in a model that mimics SVE, as an alternative to #21 . The main goal of this PR is to discuss the model, letting the bikeshedding aside (at least for now).

The main idea is the following: each target architecture has a runtime constant SIMD width (ie: it cannot change during execution of the process), but this width is not known at compile time because we don't know the target architecture at this point. All vectors have a width that is equal to this target SIMD width.

Smaller vectors are handled with masked operations. Each operation defines how it handles inactive elements. There usually is 3 policies: zero inactive elements, forward inactive elements from the first input, or letting them undefined. This third policy might be controversial and could be removed, but it gives more flexibility to WASM engines on the cases where the user actually does not care.

There is 10 types: 5 vectors (vec.v8, vec.v16, vec.v32, vec.v64, vec.v128) and 5 masks (vec.m8, vec.m16, vec.m32, vec.m64, vec.m128). This is needed because of the different way target architectures handle masks. Some have type-width agnostic masks (eg: AVX512), while some have type-width aware masks (eg: SSE, SVE). Technically, it might not be necessary to have multiple vector types and only keep multiple mask types, but I think it simplifies the mental model to have a direct correspondence between vectors and masks (you cannot use a vec.m8 mask with a vec.v16 vector).

128-bit elements might not be necessary, but they would provide users a simpler way to deal with flexible vectors as they could have all the SIMD operations on 128-bit blocks they are used to (including compile-time shuffles).

There are many operations missing, eg: gathers, scatters, compact, expand, narrowing arithmetic, widening arithmetic, some masks operations (equivalent to break control flow), first-fault loads. Currently, mask sizes are not defined and thus, no wait to load/store masks from/to memory exist. It is possible always possible to convert a mask into a vector and have this vector in memory, but this looks contrived. I see 2 orthogonal ways to solve this issue: let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element). The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).

More details in the README.

(edit) The points I wish to be discussed:

The semantics of masks (eg: do we need more/less operations?)
The semantics of narrowing/widening operations
The semantics of swizzling
Is this model enough for most use cases?
Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

Feb 19 '21 14:02 lemaitre

Hello everyone, this is an attempt to formalize flexible vectors in a model that mimics SVE, as an alternative to #21 . The main goal of this PR is to discuss the model, letting the bikeshedding aside (at least for now).

My main issue is that the PR replaces the entire proposal :) Also, there are some instructions other than just masks that are added - do you want to discuss those?

The main idea is the following: each target architecture has a runtime constant SIMD width (ie: it cannot change during execution of the process), but this width is not known at compile time because we don't know the target architecture at this point. All vectors have a width that is equal to this target SIMD width.

This is already true. Sorry, it wasn't very clear before, I've updated the readme to make this a bit more explicit.

Masks change is still welcome - let's just fit it inside what is already set up if possible. If you fill like it can't be reconciled with the current state of the proposal I will be happy to hear why, maybe we can do something about it.

Feb 20 '21 03:02 penzn

My main issue is that the PR replaces the entire proposal :)

Yeah, sorry about that. This PR is not intended to be a true PR: I don't expect it to be merged. The goal is to have a solid base on which we can discuss, and I believe that an issue would not be as convenient as a PR. As soon as we are confident enough this model can work and covers most use-cases, I will retro-fit this concept into your actual proposal.

Also, there are some instructions other than just masks that are added - do you want to discuss those?

Basically, I think we should discuss anything related to the flexible vector concept. Lane-wise operations are no issue, so I mostly skipped them, but shuffles and interleaving (for instance) need a proper definition.

So basically, the points I wish to be discussed:

The semantics of masks (eg: do we need more/less operations?)
The semantics of narrowing/widening operations
The semantics of swizzling
Is this model enough for most use cases?
Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

This is already true. Sorry, it wasn't very clear before, I've updated the readme to make this a bit more explicit.

Nice ;) However, I don't know how it will play with your optional set_length. That's why I made it very explicit that length is actually a runtime constant (I don't want a true set_length).

Feb 20 '21 10:02 lemaitre

The semantics of masks (eg: do we need more/less operations?)

I think this is tied with "use cases" question - if we have use cases for more, we can add more, and if we should not add operations we don't have use cases for.

I agree it is tied to use cases. Here are some operations that I am confident they are useful:

vec.mX.index_CMP: most used will be the index_lt to implement most loops
all masks bitwise operations
vec.mX.all|none: creating basic masks
vec.vX.CMP
vec.vX.test_none|any|all: used for branching in SIMD

There are others I'm not sure they are useful, and some that I incorrectly ported from SVE:

vec.mX.first|last: they are interesting to iterate over lanes while staying in SIMD. The SVE version takes 2 masks to implement more efficiently this use case. SVE also provides the equivalent to previous and next.
vec.mX.index_first|last: transform a mask into a lane index, might be useful to chain with lane index operations (like insert or extract), but potientially hard to implement on SVE/Risc-V V.
vec.mX.count: count the number of active elements. I think it is mainly useful in conjunction with compact. The SVE version also takes 2 masks.

Also, for now, lane-wise operations have 4 mask policies:

nothing: the operation is not masked (all lanes are active).
_mz: The inactive lanes are set to 0.
_mm: The inactive lanes are copied from the first input (non-commutative operations like sub and div might need a reverse version that swaps the meaning of the inputs).
_mx: The inactive lanes have an undefined value. This would be useful for the WASM engine to chose was is the best policy to actually translate to.

I don't know if we actually need all 4 of them, especially the _mx one.

There is a last question about masks. In a masked ISA, almost all operations take a mask. As WASM is a stack VM, you would naively need to "copy" the mask on top of the stak over and over for each operation. How do we deal we this at the WASM level to avoid too many local.get?

The semantics of narrowing/widening operations

What would be the concern here? With values encoding lane types we can go between lane sizes relatively easily.

The question here is what lanes are combined together when narrowing 2 vectors together, and what lanes are selected when widening half of vector. Basically, does narrowing concat inputs, or interleave inputs? Does widening selects low/high half or even/odd half? This might not be an issue as there seem to be consensus on this subject across architectures (ie: narrowing concats and widening selects low/high).

Also, do narrowing operations take multiple input to form a full vector, or do we just fill half of the output? I think the best way would to take multiple vectors as input just because it will a bit harder to implement the former behavior from the latter.

The semantics of swizzling

That is the most "interesting" aspect of flexible length vector operations - it is very hard to do general purpose swizzle and shuffle if you don't know the length (especially shuffle with compile-time masks). However it is definitely possible to do things like concat, unpack, shift lane, etc.

Full-width shuffles with compile-time masks are out of the question. The best we could propose is full-width runtime shuffles, where the index vector comes from an eventual vec.vX.const.

The common swizzling operations are however already defined here:

vec.vX.interleave_even|odd
vec.vX.interleave_low|high
vec.vX.concat_even|odd
vec.vX.lane_shift (and the mask only equivalent vec.vX.concat)

Full-width runtime shuffles are handled with vec.vX.LUT1. For now there is 2 out-of-bounds policies:

out-of-bounds are set to 0
out-of-bounds are copied from a fallback vector

The latter makes it simpler to implement larger LUTs (ie: from more inputs), but might not be super useful as it is already possible from the former policy. We could also envision a third policy where indices are interpreted modulo the vector length (ie: no out-of-bounds), but I'm not sure is would be super useful.

Also, SVE and Neon provide LUT2, LUT3 and LUT4 (respectively taking from 2, 3, and 4 input vectors). The equivalent of the intel permutexvar would be LUT1, and permutex2var would be LUT2 (indices are wrapped).

There is a potential issue with those LUT operations with vec.v8 if the width can be larger than 2048: a single byte would not be enough to address lanes from the whole vector (number of lanes > 256).

Is this model enough for most use cases?

The devil is in details - very basic algorithms would work with even less, but if we want to port code that already works with existing vector instruction sets we need to come up with something compatible. What sort of algorithms you have in mind? Also, there are some examples in #5.

I know that's the tricky bit... Here is a small list of algorithms that definitely need to be implementable:

gemm (matrix multiplication)
reductions (not necessarily additions)
Stencils
prefix sums (1D)
Integral images (like prefix sums but in 2D)
interpolations
On-The-Fly narrowing or widening (ie: chaining conversions with computation instead of having a separate conversion step)

I am also personally interested in more algorithms:

Run-Length Encoding
Union-Find (graph processing)
Background subtraction (like Sigma-Delta)

Is this model efficiently implementable on "legacy" architectures (namely SSE, and Neon)?

Masks are tricky on those architectures, however "set length" approach is tricky as well (that's why it is now optional). Originally I thought of implementing set length by zeroing out left out portion of the vector, but that would be still quite expensive.

I think the best way would be to have a way for the engine to optimize away the mask when it is known to be full (or empty). A vec_loop construct that would be semantically equivalent to a regular loop incrementing by vec.vX.length and with a vec.mX.index_lt could help as the engine can see the only the last iteration will have a non-full mask. The engine could monomorphize all the functions taking a mask to specialize the case where the mask is full.

However, I don't know how it will play with your optional set_length. That's why I made it very explicit that length is actually a runtime constant (I don't want a true set_length).

Sure, I see your point - setting length might mean setting mask length, or something like that, but that is not very practical.

I think set_length would not be necessary as vec.mX.index_lt generates a mask that as the same semantics as set_length.

Feb 23 '21 10:02 lemaitre

There is a potential issue with those LUT operations with vec.v8 if the width can be larger than 2048: a single byte would not be enough to address lanes from the whole vector (number of lanes > 256).

Right. SVE is limited to 2048 bit anyway but this could bite on RVV. FYI there seems to be a tendency there to impose at least a 64K element upper bound (previously there was none). I suspect that just wrapping around is the best we can do.

Feb 23 '21 10:02 jan-wassenberg

@jan-wassenberg

I suspect that just wrapping around is the best we can do.

Out-of-bound policy is completely orthogonal to maximal vector length. I think for LUTs (ie: shuffles), out-of-bounds are 0 (or fallback) are really useful to chain LUTs when the actual LUT is larger than a single vector.

Feb 23 '21 11:02 lemaitre

For iota(), wraparound seems much more useful than 0 - it would still allow differentiating odd and even lanes. For LUT2, could that not be implemented independently of overflow using compare+select/blend?

Feb 23 '21 13:02 jan-wassenberg

For iota(), wraparound seems much more useful than 0 - it would still allow differentiating odd and even lanes.

Most operations on lanes use the wraparound policy, namely: extract_lane, replace_lane, splat_lane, load_lane, store_lane, lane_shift. In fact, only LUT uses a different policy to make it simpler to chain them. Maybe we want both for LUTs.

For LUT2, could that not be implemented independently of overflow using compare+select/blend?

Yes, you can chain them with compare+select, but that could be less efficient on some architectures, especially Neon and SVE. And I think that LUT will often be used in chains because you usually don't know how many registers you need for the LUT. Also, emulating the wrap around is simpler (just a binary and) than emulating the out-of-bound zeros (compare+select).

Feb 23 '21 13:02 lemaitre

Basically, does narrowing concat inputs, or interleave inputs? Does widening selects low/high half or even/odd half? This might not be an issue as there seem to be consensus on this subject across architectures (ie: narrowing concats and widening selects low/high).

I think there is consensus, at least for some part that narrowing concats the intputs and widening uses low/high.

Feb 24 '21 02:02 penzn

I also had a chance to look at the proposal.

Thanks!

... let mask types having sizes and have a way to load/store them in their natural, architectural representation, or have a way to load/store them in a compact format (1 bit per element). The second would make the most sense as an abstraction layer, but could be slow on some architectures (Neon for sure, maybe SVE).

Not great for SVE either.

After thinking about it, I think we should provide multiple instructions to let the user decide how they want to store the masks:

1 bit per element,
1 bit per byte,
as a vector

IMHO it would not be hard to implement with SVE, but it would be very inefficient because it would involve a round-trip from the predicate registers to the general-purpose ones and back. Taking the combination of vec.mX.index_first and extract_lane as an example, my expectation would be that compilers would pattern-match the combination and just generate a LASTB instruction (simplifications are possible in specific cases). From SVE's perspective, the ideal solution would be to have an operation that extracts an element based on a mask value, but I am not sure how well that maps to other architectures.

Extracting a value based on a mask value would be a nightmare on x86, as you would have to first extract the mask in general purpose register, then find the index of the first set bit (with tzcnt), then move the result to back to SIMD register, shuffle the input vector, and finally extract the first lane value (would be 15 cycle latency, if I'm not mistaken).

So what I proposed should somehow minimize the overhead across architectures. And yes, some pattern matching could be make it more efficient.

Jun 12 '21 12:06 lemaitre

Keep in mind that may require pattern-matching across basic blocks, which might prove a challenge, especially for simpler JIT compilers.

I know that pattern matching has limitations, and that's why the overhead without pattern matching should be minimal. Pattern matching should be the cherry on top of the cake.

Most of the time, it would be possible to duplicate the instructions to overcome the genericity issue (in that case, one instruction taking a scalar index, and one taking a mask), but I think the number of opcodes is rather limited, so I chose the one that I thought had less overhead overall.

Jun 14 '21 17:06 lemaitre

To set the record straight - while this proposal (the repo) draws inspiration from vector instruction sets, the first goal is to extend WebAssembly support for hardware fixed-width SIMD beyond 128 bits. The motivation is simply hardware availability (mostly) on the consumer market. On the other hand, we obviously don't want to close the door to SVE and RISC-V, though the case for those is not strong yet.

Not to discourage the work done here, the challenge with adding another platform is that it increases the surface area of the proposal, necessitating more compromise-finding :) I haven't looked very close on this discussion yet, still stuck in opcode generation. In order to cut down the attention spread, we can dedicate some time in the SIMD conference call (#37), if that works for both of you.

Jun 14 '21 17:06 penzn

The motivation is simply hardware availability (mostly) on the consumer market. On the other hand, we obviously don't want to close the door to SVE and RISC-V, though the case for those is not strong yet.

Just to add a bit to the SVE market availability angle - Arm has already announced processor cores with SVE support that cover almost the full range of the A architecture profile; the latest announcement was 3 weeks ago and concerned the next generation mobile cores. While it is true that it is not possible to buy anything featuring them on the market right now, if the past provides any indication about the future, then we should expect actual products in at most a year.

In order to cut down the attention spread, we can dedicate some time in the SIMD conference call (#37), if that works for both of you.

It should work for me; however, I think that @lemaitre and I are mostly in agreement, and I have just been asking clarifications on specific points.

Jun 14 '21 18:06 akirilov-arm

Arm has already announced processor cores with SVE support that cover almost the full range of the A architecture profile; the latest announcement was 3 weeks ago and concerned the next generation mobile cores.

I have seen the SVE2 announcement, but missed this one :) Sure, this makes things different.

Jun 15 '21 17:06 penzn

Hi,

I've done a little experiment to see what the effects are of using masked memory operations vs using a vector loop with a scalar remainder. I've used the most basic loop I could think of:

void test_loop(int32_t * __restrict__ a, int32_t *b, int32_t *c, unsigned N) {
for (uint32_t i = 0; i < N; ++i) {
  a[i] = b[i] + c[i];
}

And this is compiled, with clang-13 on Linux, using various pragmas to disable loop unrolling, set the vector width and enable/disable predication. Below shows the relative execution speed of the loop with different pragmas.

Loop	i7-10610U	Cortex-A72 (Raspberry Pi)
scalar	1.00	1.00
4xi32 vector	7.72	2.63
8xi32 vector	7.26	2.74
16xi32 vector	3.99	2.80
4xi32 predicated vector	2.20	0.74
8xi32 predicated vector	3.41	0.67
16xi32 predicated vector	3.52	0.68

I have no idea about AVX-2 and Skylake, but the super linear results for 4xi32 and 8xi32 surprised me. What wasn't so surprising was the negative affect that predicated memory operations have on performance. It would appear that using the native masking operations roughly halves the throughput compared to the non-predicated version.

On the NEON based Cortex-A72, we don't have any native masking support, so performance is significantly worse than the scalar loop.

So I can't see how introducing masked memory operations into wasm is currently viable, when they're not supported well (or at all) on the majority of currently available hardware.

Nov 10 '21 16:11 sparker-arm

@sparker-arm I don't know what you measured, but I think you have put too many constraints on the compiler for your benchmark to be relevant. I don't get where you expect masks to be used in your code. The compiler should not generate masks instructions for the loop body because you have no conditionals. The only place it would make sense in your code would be in the epilog, but it is hard to measure the time of the epilog. If your loop is too long, the epilog time would be negligible; if it is too short, you will measure the time of your timer, not your code.

So I gave a try on measuring the performance of masked stores on x86_64, but in the most complex case: ie, masked stores at every single iteration, and with random masks (not the special case ones from an epilog).

The kernel measured looks like that:

const int N = 1024;
int A[N], B[N];
void scalar() {
  for (int i = 0; i < N; ++i) {
    int b = B[i];
    if (b) A[i] = b;
  }
}

The complete list of versions and their descriptions is available in this gist: https://gist.github.com/lemaitre/581f3416dc3abc9944d576d08c2a444b

Here are the results on a i9-7900X (time in cycles per point):

version	random	always	never
scalar_v0*	2.04	2.04	2.05
scalar_v1	12.7	2.03	3.02
scalar_v2	2.04	2.04	2.06
SSE4_v0*	0.539	0.531	0.539
SSE4_v1	12.5	1.54	1.95
SSE4_v2	2.13	2.11	2.13
SSE4_v3	7.84	2.29	2.29
SSE4_v4	7.99	1.27	1.03
AVX2_v0*	0.309	0.281	0.289
AVX2_v1	0.299	0.281	0.291
AVX512_v0*	0.156	0.148	0.139
AVX512_v1	0.139	0.145	0.117

Some caveats: v0s (*) are only valid in monothread context, and SSE4_v3 uses the maskmov instruction which bypass the cache and thus have a huge latency, but I only measured the throughput here. SSE4_v1 and v2 are plain old scalar emulation (so directly usable on ARM).

So basically, scalar emulation is on par with plain scalar (SSE4_v2), and even possibly faster if we know there is a predictable pattern (SSE4_v4). This shows that we can propose masked instructions as they are as fast as scalar even though it is emulated with scalar. With hardware support, it is incredibly fast.

For prologs, this can be made even faster because the masks have a special shape and can be further optimized.

Nov 11 '21 12:11 lemaitre

Hi @lemaitre Each loop was run over 100,000 elements and a minimum time was taken from 10 runs. I'm using predication to remove the epilogue, so all the memory operations in the loop are masked. I see this is one of the main reasons for having flexible vectors: so that wider vector units aren't penalised for smaller loops and also so we don't have to have distribute multiple versions of a loop. Or do you envision a compiler producing different versions of each loop?

And pardon my ignorance, but why are you only looking at stores?

Nov 11 '21 14:11 sparker-arm

Each loop was run over 100,000 elements and a minimum time was taken from 10 runs.

Over such a large array on a simple computation like yours, you should have been limited by L3 cache bandwidth, not by masking.

I'm using predication to remove the epilogue, so all the memory operations in the loop are masked.

I don't think that is a viable way to generate efficient loops as even on AVX512 (with native masking), the overhead of computing the mask every single loop iteration can become non-negligible (even though the masking itself is). What I envision is to duplicate the loop body for the epilog, and to make only the epilog using masks. This produce a larger bytecode, but keeps its efficiency on every architecture. In such a situation, masking is more efficient than scalar remainder, even with the scalar emulation for the masking operation.

So basically, the compiler would generate exactly 2 versions of the loop: without masks for provably full iteration, and with masks for the remainder. If we want, we could have specialized masks for more optimized epilog handling, but it is not even mandatory.

And pardon my ignorance, but why are you only looking at stores?

Because that's the hardest part of masking. Masked load could be done with full load, and in-register masking. For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load in that case (because inactive lanes should not trigger a segfault). We could even skip the check, and just trap the segfault.

The problem with stores is you should not write to inactive lanes, because another thread might write to them: the full-width store can introduce race-conditions. That's why my v0s are wrong.

If I make the masked store emulation as fast as scalar, I consider it a win, because I can write the rest of my kernel with SIMD without any extra overhead.

Nov 11 '21 16:11 lemaitre

For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load in that case (because inactive lanes should not trigger a segfault).

Just a quick note - in practice aren't almost all loads effectively unaligned ones? The alignment specified by the memarg is just a hint and I doubt that the Wasm compiler would be able to prove that the access is aligned that often.

Nov 11 '21 16:11 akirilov-arm

in practice aren't almost all loads effectively unaligned ones?

I also think so. At least if we keep the hint semantic. Personally, I would much prefer to make it a constraint. But even though, the check that load does not cross page boundary should be cheap, especially because page boundary will not be crossed often, and also the branching pattern should be mostly regular.

Nov 11 '21 17:11 lemaitre

Sure, considering each branch in isolation, the branching pattern should be mostly regular, but the branch density might have negative interactions with branch prediction (which wouldn't be an issue with the scalar version). That could be compensated with inserting no-op instructions at the expense of code size.

Nov 11 '21 17:11 akirilov-arm

Sure, considering each branch in isolation, the branching pattern should be mostly regular, but the branch density might have negative interactions with branch prediction (which won't be an issue with the scalar version).

First, the naive (branchy) scalar version will eat up the same amount of "branch predictor space" than my hypothetical branch on page boundary crossing (if scalar is unrolled, then scalar would be worse in this regard). Then, I wrote a branchless masked-store that is on-par with the branchless conditional scalar store. We could have the exact same thing for loads.

Finally, I already mentioned it multiple times, but we could trap segfaults, and check if it comes from a legitimate masked load or not. The probability that such a masked load actually segfaults on inactive lanes is so low that the overhead from the signal is not relevant.

Nov 11 '21 17:11 lemaitre

Over such a large array on a simple computation like yours, you should have been limited by L3 cache bandwidth, not by masking.

Okay.... to make you happy, 1000 elements:

Loop	i7-10610U	Cortex-A72 (Raspberry Pi)
scalar	1.00	1.00
4xi32 vector	5.49	3.04
8xi32 vector	4.97	3.27
16xi32 vector	5.49	3.27
4xi32 predicated vector	2.13	0.74
8xi32 predicated vector	3.24	0.67
16xi32 predicated vector	3.33	0.68

What I envision is to duplicate the loop body for the epilog, and to make only the epilog using masks.

This is reasonable, but you'll still penalize SSE and Neon for kernels with small loop iterations, creating a performance cliff on certain architectures.

For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load

Right... and as mentioned, that's all of them?

the check that load does not cross page boundary should be cheap

Do you know of any compilers that do this currently?

we could trap segfaults, and check if it comes from a legitimate masked load or not.

This sounds like a bad design choice, a compiler shouldn't take valid code, make it invalid and then rely on all the runtimes to handle the fault. We (runtimes) would most likely just have to scalarize and the branch density will not be good.

Nov 12 '21 08:11 sparker-arm

Okay.... to make you happy, 1000 elements:

No, I did not make myself clear enough. My point was such a simple kernel is limited by something than memory bandwidth (or at least L3 bandwidth) for 12 MB data, your benchmark may be flawed somewhere (or you have an impossibly fast memory). But as you don't provide the code, I cannot tell you more.

This is reasonable, but you'll still penalize SSE and Neon for kernels with small loop iterations, creating a performance cliff on certain architectures.

How is it penalizing SSE and Neon? Can you write an epilog that is faster than using masked loads/stores? Scalar will be slower. And the larger the vector, the more expensive a scalar remainder will be. A vector remainder has not such problems, but is only possible using masked instructions. Also, flexible vectors are all about "creating" a performance cliff: we want to leverage all the power from our fast archs.

The real point is: not penalizing old archs just for the sake to make recent ones faster. But introducing masked instructions does not cripple SSE/Neon, because the alternative would be scalar, which is not faster. In actual code, scalar will even be slower because the whole loop would be scalar, while here, only the masked memory accesses would be.

For unaligned loads, it might be necessary to add a check to see if the full-load might cross page boundary and fallback to slower load

Right... and as mentioned, that's all of them?

You don't get it. If your scalar have a branch for the conditional access, my solution does not introduce any extra ones (one branch in both cases). If your scalar code does not have a branch, fine, we can also implement masked load/store without any branch (SSE4_v2), it would be just a bit slower than the branchy one on predictable patterns, but still as fast as the scalar one (or even faster on more complex codes).

the check that load does not cross page boundary should be cheap

Do you know of any compilers that do this currently?

we could trap segfaults, and check if it comes from a legitimate masked load or not.

This sounds like a bad design choice, a compiler shouldn't take valid code, make it invalid and then rely on all the runtimes to handle the fault.

The compiler has nothing to do with all of that! The compiler will just generate the WASM bytecode that will contain masked memory instructions. Then, the WASM engine (ie: runtime) will convert this bytecode into actual machine code, and this is where the page boundary cross check would be integrated and/or the signal trap. As it is done by the runtime, the runtime itself can be sure that trap will be there.

We (runtimes) would most likely just have to scalarize and the branch density will not be good.

I don't get your point. First, it is possible to implement masked load/store without any branch (have you checked my code?), even if it is a "scalar emulation". Then, the signal approach does not introduce any branch either. Only where a load is at the edge of a system memory allocation that it would be possible to trigger the fault. It is very unlikely to happen in real code, and thus, this case can be slow, and we would not care.

Nov 12 '21 10:11 lemaitre

The only paper I've seen on this approach is this, and it certainly shows potential. If you have some more links I'd be interested in having a look. But for all your assertions about this approach being faster, I'd like to see some examples of a compiler choosing to vectorize, using masked load and stores, for an architecture that doesn't support them. Having an existing production system being able to make sensible choices would appease me completely, otherwise it seems like a leap of faith in we think a wasm compiler will be able to do. And if a wasm compiler can't use this stuff effectively, then it will just be disabled - and nobody wants that!

How is it penalizing SSE and Neon? Can you write an epilog that is faster than using masked loads/stores? Scalar will be slower.

Treat my example as an epilogue, scalar is faster on my raspberry pi using the current magic in LLVM. I think the problem here is that we're thinking about masking from two different angles, I think you're mainly considering removing control-flow and I'm considering how to represent a vector loop in the 'flexible' manner. I'd suggest that your primary use case is a separate issue from enabling wider vector widths and is, instead, the next step in Wasm SIMD saga.

Nov 12 '21 11:11 sparker-arm

But for all your assertions about this approach being faster, I'd like to see some examples of a compiler choosing to vectorize, using masked load and stores, for an architecture that doesn't support them

I don't care about what compilers are doing now. I care about what we can do. And we can do much better in that regard. You want me to prove that masked remainder is faster than scalar, fine, here it is: https://gist.github.com/lemaitre/a3510fa5d1d3cace46adc56be73e07ab

The kernel looks like this:

void func() {
  for (int i = 0; i < N; ++i) {
    int a = A[i];
    for (int j = 0; j < a; ++j) {
      B[j] += 1;
    }
  }
}

I have 5 versions:

scalar_v0: scalar code with vectorization disabled
scalar_v1: scalar code and vectorization forced
sse4_v0: simd code with scalar remainder
sse4_v1: simd code with branchless masked remainder (optimized masks)
sse4_v2: simd code without remainder (lower bound of the processing time)

The number of iteration in the inner loop is small and random. The range is given in the header of the table:

version	[0, 4]	[0, 8]	[0, 16]	[0, 32]
scalar_v0	20.8	26.6	35.8	49.8
scalar_v1	22.0	31.4	39.6	45.3
sse4_v0	22.6	33.5	42.2	48.0
sse4_v1	9.57	18.6	30.2	34.5
sse4_v2*	7.97	14.3	23.3	27.5

We can clearly see that scalar remainder is slow on tiny loops. While my branchless masked remainder does a great job and is quite close to the no remainder code. The most impressive aspect of this is that the loop body is ridiculously tiny (a load, a plus, and a store), so this is where the cost of the masking will be the most visible. But it still outperforms the scalar remainder nonetheless. I think a branchy masked remainder can be even more effective as long as there is not much pressure on the branch predictor.

And if a wasm compiler can't use this stuff effectively, then it will just be disabled - and nobody wants that!

I agree with that. And my results show that it can be effective. Ok, for now, it is hand-coded, but if you look at the code, you will see that automatic generation of such masked memory access will be easy.

Treat my example as an epilogue, scalar is faster on my raspberry pi using the current magic in LLVM.

I don't consider your example at all as your results show there is something weird with your implementation, yet you do not provide it for me to check why.

I think the problem here is that we're thinking about masking from two different angles, I think you're mainly considering removing control-flow and I'm considering how to represent a vector loop in the 'flexible' manner.

My stance is that masks are needed anyway, so we can just use them to implement remainder. I showed that, performance wise, it is nice, even though the masks are emulated.

Nov 12 '21 17:11 lemaitre

Oops, closed by accident.

Nov 12 '21 18:11 penzn

So I can't see how introducing masked memory operations into wasm is currently viable, when they're not supported well (or at all) on the majority of currently available hardware.

@sparker-arm I am not sure about what is the point you are trying to make. Is that AVX2 (which your CPU has) doesn't have predicates? Or that there are challenges with trying to emit them in autovectorized programs?

There are obvious differences between predicate support between different ISA generation and different architectures, but I am confused about how your example is supposed to demonstrate those.

void test_loop(int32_t * __restrict__ a, int32_t *b, int32_t *c, unsigned N) {
for (uint32_t i = 0; i < N; ++i) {
  a[i] = b[i] + c[i];
}
And this is compiled, with clang-13 on Linux, using various pragmas to disable loop unrolling, set the vector width and enable/disable predication. Below shows the relative execution speed of the loop with different pragmas.

Specifically, I am not quite sure what are you observing when you are adding loop pragmas to a scalar A + B loop (in addition to what @lemaitre already said about remainder vs the rest and memory hierarchy). I think the pragmas you are using would potentially interact with one another. The fact that simple vectorization runs much slower than scalar is a bit of a giveaway:

Loop	i7-10610U	Cortex-A72 (Raspberry Pi)
scalar	1.00	1.00
4xi32 vector	7.72	2.63
8xi32 vector	7.26	2.74
16xi32 vector	3.99	2.80
4xi32 predicated vector	2.20	0.74
8xi32 predicated vector	3.41	0.67
16xi32 predicated vector	3.52	0.68

(this a copy of the original table from https://github.com/WebAssembly/flexible-vectors/pull/27#issuecomment-965524787, before reduction in size, to avoid scrolling up and down)

If the effect of vectorization was 3 or 8 times drop in performance, as in this data, then nobody would be really using it :) Autovectorization is a multi-stage process, where first the compiler has to find enough parallelism, then select the right operation, and so on, all of it interacting with other transformations. By adding extra pragmas you are likely disrupting it somewhere. Two more symptoms are that supposedly predicated code is actually running on non-predicated instruction set, and increasing vector size beyond what platform supports yielding better performance on x86. In a way, comparison between predicated and non-predicated vector code running time runs counter to the point you seem to be making about mask support, but I have doubts about this methodology in general.

If you want this to be a bit more thorough, it would be good to list assembly for each kernel paired with clang commands and full sources with the pragmas (in a gist or a separate repo).

Nov 12 '21 20:11 penzn

Sorry @penzn, the table shows the relative performance / speedup so hopefully that makes a bit more sense now! This is how I'm using the pragmas, to force the vectorizer into the various settings, and these work when targeting aarch64 or x64.

void test_predicated_vector_256(int32_t * __restrict__ a, int32_t *b, int32_t *c, unsigned N) __attribute__((noinline)) {
  #pragma clang loop vectorize(enable)
  #pragma clang loop vectorize_width(8)
  #pragma clang loop vectorize_predicate(enable)
  #pragma clang loop unroll(disable)
  for (uint32_t i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
  }
}

Nov 15 '21 09:11 sparker-arm

@lemaitre I'm still trying to get my head around your benchmarks, it's a lot of code for someone who doesn't know x86! I don't think I understand how your examples are valid in the wider context of wasm though, you can write branchless remainder code for x86, whereas we cannot for aarch64. In aarch32/thumb-2 this would be okay, but I'm not sure what other architectures support conditional load/stores.

I'm also thinking about the feasibility of using traps to handle 'bad' vector loads. For an OoB load, how would we (runtimes) determine that the load is actually 'okay'?

Nov 15 '21 15:11 sparker-arm

@sparker-arm

you can write branchless remainder code for x86, whereas we cannot for aarch64

Even though cmov in x86 can be used for conditional load, it is actually kind of useless as it does not prevent faults if the address is invalid and the condition not met. The trick I used is to use cmov to select either the address in the array or a dummy address on the stack, and then use an unconditional load/store. This can also be implemented in ARM using CSEL instruction. In fact, we could even compute the addresses (a bit) faster using SIMD.

Now, the code you generated with clang does work, but is horrendous: https://gcc.godbolt.org/z/zs38eMveK No wonder why it is slow.

EDIT: The x86 version looks better, but it falls back to scalar operations, not masked vectors: https://gcc.godbolt.org/z/79brKWoxz

The thing you seem to have missed is that the loop body does not need predication, only the remainder. This is a very simple way to make the predication cost negligible.

In fact, that's what my code tries to benchmark. The inner loop is small and have a remainder (implemented either in scalar, or with predicated load/store), and the outer loop just repeats the inner one with a random number of iterations. That way, I can measure the time taken by the loop, while the remainder is not negligible. To be noted that my code is explicit about how to do the remainder.

Of course, the method implies that the loop body is duplicated, but I think it is a "necessary evil" to have efficient loops. And just to put things in perspective, the code generated by your code have four times the loop body (if I counted correctly).

I'm also thinking about the feasibility of using traps to handle 'bad' vector loads. For an OoB load, how would we (runtimes) determine that the load is actually 'okay'?

We would need a list of masked memory loads. If the current instruction address is in the list, that was a masked load, and a more robust masked load should be used. In order to keep the list small, we could make this list per function. Or we could add some special noops before or after the load that the signal handler could check to see if the load was a masked one (and where the mask is). Finally, because WASM has only a single memory region, we could reserve an extra page at both end of the region to ensure a masked load with at least a valid lane would never fault.

Nov 15 '21 21:11 lemaitre

flexible-vectors flexible-vectors copied to clipboard

[WIP] SVE-like flexible vectors

flexible-vectors
flexible-vectors copied to clipboard