designs Initial Draft for Vector SIMD Codegen Enhancements

trafficstars

This PR describes two proposals meant to enhance SIMD codegen in RyuJIT:

Enhance the capability of Vector<T> to serve as either a template to generate multiple SIMD ISA pathways or dynamically recompile its chosen SIMD ISA at runtime via performance guided optimization: https://github.com/anthonycanino/designs/blob/main/accepted/2022/enhance-vector-codegen.md
Introduce additional functionality to Vector<T> through new abstractions (VectorMask) and 512-bit vectors: https://github.com/anthonycanino/designs/blob/main/accepted/2022/enable-512-vectors.md

Looking forward to feedback and discussion on the ideas.

Jul 07 '22 22:07 anthonycanino

CC. @JulieLeeMSFT, @jeffhandley

Also CC. @davidwrighton for CG2/R2R considerations and @AndyAyersMS for PGO considerations

CC. @BruceForstall and @dakersnar who are part of the working group

Jul 12 '22 19:07 tannergooding

cc @dotnet/jit-contrib.

Jul 12 '22 20:07 JulieLeeMSFT

Hi,

Arm person here... and I apologize now for really not knowing anything about .net, so please pardon my ignorance!

I really like the sound of focusing on the generic vector API, it seems most architectures are at least considering old-school vectors...

I've been having a look through the existing Vector APIs and it looks like Vector<T> is only partially implemented for hardware acceleration, or am I misunderstanding? If I've not misunderstood, is the long-term plan to migrate the fixed APIs into the generic one and leave this as the only public API? I'm assuming, if the idea is to program AVX512 using the generic one, it would be advantageous to have fallback paths to more narrow extensions, before scalarizing. And are there plans for implementing 'weird' stuff like shuffle, permute and gather/scatters, etc... for Vector<T>? (Sorry if I've missed them)

I'm not sure I follow why you'd need all off the *Leading and *Trailing operations, would three not suffice? Something like, MaskedLoad, MaskedStore and GenerateRemainderMask? And, if the Vector<T> API is being extended add a mask argument, are there not existing API calls which could be used for masked load/store operations?

Continuing on the theme of predication, what is the defined behavior for 'false' lanes, can it be programmed? I'm thinking in the context of SVE, where predication can enable zeroing or merging lanes. Does AVX512 enable something similar?

Aug 12 '22 11:08 sparker-arm

So, what happens if the compiler decides that sometimes we should use AVX-512 for some operations, but AVX2 for others? Is there enough type information to enable this with Vector<T>? Java has a stronger type system for their vectors, would there be any blockers for why it couldn't be done here?

I also have concerns that this approach will not be performant for SVE in an AOT setting, as everything depends on being able eventually have a compile-time fixed size vector. It's possible to set the width SVE at runtime, but I understand this is only possible at the kernel level, and we can't depend on all platforms providing hooks to do this. The other option would be to use predication to fix the width in the generated code, but this is likely to produce slower code.

Flexible (sizeless) vectors could be coming to WebAssembly and we've (hopefully) made the necessary changes in cranelift to enable this for both AVX and SVE. That approach is really based on the idea of using WebAssembly's 128-bit vectors with the addition of a scaling factor to produce a target-defined vector type. This there any way that dotnet could handle a notion of a scaling factor, that could be compile- or runtime-defined?

Oct 06 '22 10:10 sparker-arm

I've been having a look through the existing Vector APIs and it looks like Vector<T> is only partially implemented for hardware acceleration, or am I misunderstanding? If I've not misunderstood, is the long-term plan to migrate the fixed APIs into the generic one and leave this as the only public API? I'm assuming, if the idea is to program AVX512 using the generic one, it would be advantageous to have fallback paths to more narrow extensions, before scalarizing. And are there plans for implementing 'weird' stuff like shuffle, permute and gather/scatters, etc... for Vector<T>? (Sorry if I've missed them)

Vector<T> is an agnostic API that provides a software fallback for when acceleration isn't available. When Vector.IsHardwareAccelerated reports true then the majority of functions are expected to be accelerated and sufficiently (but maybe not "optimally") performant. There may still be cases, such as integer division, where the implementation has to fallback to a purely software-based approach.

I'm not sure I follow why you'd need all off the *Leading and *Trailing operations, would three not suffice? Something like, MaskedLoad, MaskedStore and GenerateRemainderMask? And, if the Vector<T> API is being extended add a mask argument, are there not existing API calls which could be used for masked load/store operations?

I'd typically expect a MaskedLoad/MaskedStore to read a full vector from the given address and then mask off the upper bits. There can be complications and extra expense that arise from this approach including in dealing with whether the masked bytes should cause an AccessViolation if they cross a page boundary and the masked bytes aren't available for reading.

On the other hand, a Leading or Trailing API can take advantage of the data alignment and count to more trivially do something like "backtrack Size - remainder bytes", "load/store", "shift/shuffle data to the correct element position". This ends up making it easier to deal with the explicit concept of leading or trailing element processing in a way that allows the compiler to make it more efficient.

Continuing on the theme of predication, what is the defined behavior for 'false' lanes, can it be programmed? I'm thinking in the context of SVE, where predication can enable zeroing or merging lanes. Does AVX512 enable something similar?

Yes. AVX512 has a concept of both "merge-masking" and "zero-masking". We would likely have functionality exposed in the generic APIs to enable the same.

So, what happens if the compiler decides that sometimes we should use AVX-512 for some operations, but AVX2 for others? Is there enough type information to enable this with Vector<T>?

This would likely end up PGO or data driven in some other fashion. It would default to "Max" (or "ReasonableDefault") otherwise. The types by themselves could never provide enough data.

Java has a stronger type system for their vectors, would there be any blockers for why it couldn't be done here?

While we could enable something like VectorShape, we've found that such functionality overall makes the UX overall more complicated as compared to concrete types.

Users are able to trivially query for type support, hardware acceleration, and have it treated as a JIT time constant so that they can detect and choose the based fit explicitly whether that be a specific sized vector or an agnostic vector that "grows" to the maximum capabilities of the platform.

I also have concerns that this approach will not be performant for SVE in an AOT setting, as everything depends on being able eventually have a compile-time fixed size vector. It's possible to set the width SVE at runtime, but I understand this is only possible at the kernel level, and we can't depend on all platforms providing hooks to do this. The other option would be to use predication to fix the width in the generated code, but this is likely to produce slower code.

Yes. AOT in general is a problematic consideration for such code. In C/C++ SVE puts restrictions on how the underlying "vector" type can be used. For example, it disallows its usage as a field, in sizeof expressions, and a few other scenarios. Vector<T> is a pre-existing type where devs have already used it in such scenarios and that potentially complicates a few things and we will need to consider how to best ensure that can be supported.

Flexible (sizeless) vectors could be coming to WebAssembly and we've (hopefully) made the necessary changes in cranelift to enable this for both AVX and SVE. That approach is really based on the idea of using WebAssembly's 128-bit vectors with the addition of a scaling factor to produce a target-defined vector type. This there any way that dotnet could handle a notion of a scaling factor, that could be compile- or runtime-defined?

Simply scaling up a n * V128<T> ops at a time sounds somewhat similar to how Vector256<T> is now implemented as 2x Vector128<T> ops and how Vector256<T> will be implemented as 2x Vector256<T> ops (and therefore 4x V128<T> ops). However, this comes with many considerations around usability and perf, especially where pipelining comes into play and where issuing too many of a same operation can limit CPU throughput or even hurt perf long term.

I think source generators or an approach similar to generic math with an interface + generic specialization would be better for providing performant and "size agnostic" implementations for a given platform.

Oct 06 '22 16:10 tannergooding

Fair enough wrt to masking vs leading/trailing, thanks for the clarification.

This would likely end up PGO or data driven in some other fashion. It would default to "Max" (or "ReasonableDefault") otherwise. The types by themselves could never provide enough data.

So, IIUC, that means that any path in the API explicitly needs to handle and treat any Vector<T> as a single specific type? For instance, when AVX-512 is supported but an operation isn't supported natively, then the fallback path still needs to treat Vector<T> as a 512-bit vector?

Users are able to trivially query for type support, hardware acceleration, and have it treated as a JIT time constant so that they can detect and choose the based fit explicitly

Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.

Simply scaling up a n * V128<T> ops at a time sounds somewhat similar to how Vector256<T> is now implemented Just to be clear, the approach wasn't about cracking larger vectors into chunks, as that somewhat defeats the point, but instead it allows a target to communicate it's register width (just with the assumption that it would be a multiple of 128-bits).

Oct 07 '22 08:10 sparker-arm

For instance, when AVX-512 is supported but an operation isn't supported natively, then the fallback path still needs to treat Vector<T> as a 512-bit vector?

Correct. Vector<T> currently has a "fixed-size" for the lifetime of the process. It's possible we could change it to be "fixed-sized" per method instead, but that would likely require significantly more work and would ultimately be dependent on something like PGO to help determine what the "correct" size is.

Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.

Could you clarify on what you mean by "stronger" types here? There are multiple ways this could be interpreted and I'd like to ensure we're on the same page.

Some examples on what you believe might be error prone would be good as well.

Related to my query above, it sounds like relying on user-defined logic would be error prone, whereas using stronger types should allow the compiler to detect bugs.

How is this different from Vector<T>.Count which communicates how many T is processed by the vector? If you needed to know the exact width of the vector, you can simply Vector<T>.Count * sizeof(T) * 8 to get the bit-width.

Oct 07 '22 15:10 tannergooding

Could you clarify on what you mean by "stronger" types here? How is this different from Vector<T>.Count which communicates how many T is processed by the vector?

I mean enough information so that a compiler would pick up any type mismatches, and I'm mainly thinking in the context of using different vector sizes.

Is a user able to use Vector<T> along with fixed vectors? If so, what are the mechanisms to help prevent them from doing so incorrectly? Does the compiler use Vector<T>.Count to reason a relationship between it and a fixed type?

Narrowing/widening operations are maybe a good example of where not having a single size for Vector<T> could be useful. From what I've seen, Vector.Narrow takes two operands to combine into a full-width result and it isn't possible to support a single vector input.

To help my understanding, how would this currently get vectorized with Vector<T>? Sorry for the C...

void mul(short *a, short *b, int *c, int N) {
  for (int i = 0; i < N; ++i) {
    c[i] = (int)a[i] * (int)b[i];
  }
}

Oct 19 '22 11:10 sparker-arm

I'd typically expect a MaskedLoad/MaskedStoreto read a full vector from the given address and then mask off the upper bits. There can be complications and extra expense that arise from this approach including in dealing with whether the masked bytes should cause anAccessViolation` if they cross a page boundary and the masked bytes aren't available for reading.

This is not the case for SVE though. In SVE the direction is reversed. You are given a mask to use by the control flow managing instructions such as whilelo. From this predicated loads only load where lanes are active.

The only time you have any cross page file issues are when you have an unknown number of iterations, so e.g. a

while (true)
{
...
}

But for that we have first faulting loads.

On the other hand, a Leading or Trailing API can take advantage of the data alignment and count to more trivially do something like "backtrack Size - remainder bytes", "load/store", "shift/shuffle data to the correct element position". This ends up making it easier to deal with the explicit concept of leading or trailing element processing in a way that allows the compiler to make it more efficient.

The problem is that for SVE none of this is needed at all. At an instruction level something like a whilelo already does the training loop calculations and also produces the mask. https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar-

That's why I think the comment made at https://github.com/dotnet/designs/pull/268#discussion_r921137007 is actually quite a salient one.

I agree with that this PR is mostly about building blocks, but I'd hope that no actual code is written directly with these blocks but only a higher level API. In particular manually managing trailing loops will always be suboptimal for SVE.

and in that sense

int SumVector(ReadOnlySpan<int> source)
{
    Vector<int> vresult = Vector<int>.Zero;

    foreach (Vector<int> slice in source.SliceAsVector())
    {
        vresult += slice;
    }

    return vresult.Sum();
}

is a better API for both SVE and AVX I think. The expansion of the foreach can allow for more efficient codegen for both.

That said I personally hope more for something like this

int unsafe SumVector(int* source, int n)
{
    Vector<int> vresult = Vector<int>.Zero;

    foreach (auto iter in new VectorSource (source, n))
    {
        Vector<int> val = Vector<int>.Load (source, iter);
        vresult.Add (val, iter);
    }

    return vresult.Sum();
}

where iter will have methods such as .GetCurrentMask().

Nov 08 '22 14:11 TamarChristinaArm

We'll ultimately need something that works reasonably well across a range of hardware. That includes x64, Arm64, Wasm, and future platforms. Such support needs to account for both older hardware (pre SVE, pre AVX512) and modern hardware (SVE+, AVX512+). There will likely be a number of gaps, on all platforms, where the JIT needs to be smarter to generate better code and understand where operations can be simplified or alternatively represented.

For the case where developers want the utmost fine-grained control, we'll have the platform specific APIs available so that developers get raw access and can fully take advantage of the functionality available.

is a better API for both SVE and AVX I think.

In the sense of a single "micro-kernel", yes. Once you start getting into more complex algorithms, it ends up worse off.

Consider an algorithm that needs to Lerp and Sum. If you have 1 Lerp and 1 Sum API, then you get code that is generally "more efficient" while still being extremely suboptimal. This is particularly if you need to store intermediate results as you start needing to access and walk n times as much memory.

Effectively any logic that ends up outside the "simple operation" path is in a similar boat, where you ultimately need/want some customized logic to better account for the sequence of operations you need.

That said I personally hope more for something like this

This ends up being "more expensive" for the "core" body of the loop because it will constantly have to check if Load is going to pass the ends. The JIT could be smarter here and rewrite the loop, but that can also get quite expensive/complex.

It is likely simpler and cheaper to just have the main loop where Load reads a "full vector" and then a LoadTrailing which takes in a mask and therefore could take in the "predicate" on the SVE side to indicate how much data is left to be read. This also works well on AVX-512 where a similar mask exists and on older hardware where the mask can be used with 1-2 extra instructions to efficiently read the right amount of data without violating access boundaries.

Nov 08 '22 15:11 tannergooding

We'll ultimately need something that works reasonably well across a range of hardware. That includes x64, Arm64, Wasm, and future platforms. Such support needs to account for both older hardware (pre SVE, pre AVX512) and modern hardware (SVE+, AVX512+).

Agreed, so the expectation is that you expect users to still write ISA specific loops? The current Vector128 generic functions work well enough on all ISAs yes. I'm however not entirely convinced the definition "Well enough" on fully masked ISAs and not/partially masked ones can't mean the same.

This is particularly if you need to store intermediate results as you start needing to access and walk n times as much memory.

I'm not sure I follow you here.. The only thing I was suggesting is that all APIs should have a fully masked version, or a way for the JIT to recognize the loop mask. With an iterator the loop's governing mask is clear, so at least you can mask the operations appropriately. With an explicit counter you have to pass a true predicate to every SVE call. We don't for instance, have unmasked loads.

This ends up being "more expensive" for the "core" body of the loop because it will constantly have to check if Load is going to pass the ends. The JIT could be smarter here and rewrite the loop, but that can also get quite expensive/complex.

I'm guessing you mean here for non-fully masked ISAs. But I don't follow why? I'm guessing this has to do with how IEnumerator is lowered? I would have expected to be able to transform the iterator into a naïve counted loop for ISAs that don't fully support predication, that's also why the example gave the number of elements. I'm guessing you can't do this because the semantics of the iterator have to be preserved in case the iterator escapes the loop? (genuine question).

In which case, you can do the same without the iterator by instead using a custom class with a while loop?

I'd have expected the same number of comparisons as you would normally for the same for loop. I was also expecting that for these ISA the jit could simple peel the loop for the "remainder". But perhaps that not something that can be done?

Nov 08 '22 16:11 TamarChristinaArm

designs designs copied to clipboard

Initial Draft for Vector SIMD Codegen Enhancements

designs
designs copied to clipboard