tvm-rfcs [RFC] Adding initial SVE implementation

trafficstars

Introducing the addition of Arm Architecture's Scalable Vector Extension (SVE) in TVM containing initial VLA and predication implementation, based on earlier work by Giuseppe Rossini.

@mbaret

Aug 04 '21 16:08 MeeraN7

I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who?

Aug 04 '21 23:08 masahi

I think it is a good idea to invite people working on RISC-V support for TVM for review/discuss, since the RISC V vector extension is similar to ARM SVE. I remember some people in Taiwan are working on this. Maybe @comaniac knows who?

Thanks for bringing up. cc @yrchen

Aug 04 '21 23:08 comaniac

@sjoerdmeijer @giuseros

Aug 05 '21 13:08 MeeraN7

Thank you @MeeraN7 for the RFC. SVE is certainly an interesting topic.

Because we do not yet have SVE support in TIR. It would be useful to think carefully about how SVE can be presented and transformed. Right now the proposal contains one way to do so. However, we could use a bit of more context information, to see how the proposed design impacts the general lowering flow.

Specifically, it would be very helpful to also show examples of the code along the transformations, so we can have a better understanding the possible design tradeoffs. It might be helpful to translate some of the examples in the whitepaper to tvmscript form.

Specifically:

The TIR before SVE vectorization
The TIR after SVE vectorization right before LLVM codegen
The corresponding lowered llvm intrinsics

To touch a bit on the design alternatives(disclaimer, I only learnt the VLA/SVE by quickly reading through the manual, so I could be wrong). Based on my understanding, the main goal of VLA is to use intrisnics to represent a some what restricted loop pattern(in terms of possible items we can do). While the previous fixed length vectorization pushes the items onto the vector constructs such as Ramp and Broadcast.

I wonder if we could take a different approach for VLA/SVE. Because VLA in nature is closer to the loop pattern. Specifically, I feel we could come up with some form of "loop SVE legalization" that legalize a loop's body to all the patterns that SVE support, then leave the for loop as it is with annotation VLA. Then the code generator can take that and generate VLA code.

Aug 05 '21 15:08 tqchen

cc @junrushao1994 @vinx13 as it can be related to the future tensorization optimizations

Aug 05 '21 15:08 tqchen

HI @tqchen , I will try to sporadically comment, since this is a project I prototyped (and enjoyed :) ) when I was in Arm.

If I understand your comment correctly, what @MeeraN7 is doing is closer to what you are proposing. Instead of transforming a loop into a Ramp, and passing the ramp "as is" to LLVM (which is what is done for fixed vector length, but not doable for SVE) @MeeraN7 is legalising the loop in TIR and passing the legal loop down to the LLVM code-generator. In other words, the following loop:

for i = 1:1:10
A[i] = B[i]+C[i];
end

Once legalised becomes

for i = 1:VL:10
A[VL] = B[VL]+C[VL]
end

And then the LLVM code generator, knowing that this is a variable loop, translates this it with some LLVM intrinsics for:

Predicated load/store
Loop increment
Predication mask calculation

Please note that only load/store needs to be predicated. Other register-to-register operations (e.g., add/sub/mul) won't need predication.

Please also note that while predication is not a TIR concept, we use it to support VLA (cc @tkonolige) in the LLVM codegen. In future it should be quite straightforward to expose predication also in TIR (if required).

@MeeraN7 feel free to jump in if something I said is not correct (very likely :) )

Aug 05 '21 22:08 giuseros

Thank you @giuseros for your answer, it looked good to me. Just to add on, we know which loops need to be legalised as in TIR we do add a new ForKind called kVectorizedScalable (cc @tkonolige) which marks the loop as able to be vectorized but the value of VL is unknown. These loops are then legalised and transformed to include the unknown value VL and later passed to the code generator as @giuseros mentioned.

Aug 06 '21 10:08 MeeraN7

Thank you all for the comments so far, we are currently working on a response to address them.

Aug 06 '21 10:08 MeeraN7

Thanks @MeeraN7 . I think the overall loop legalization makes sense. I wonder then if it is necessary to update the node construct such as Ramp, or can we directly make use of scalar loop in the body

Aug 06 '21 13:08 tqchen

Thanks @MeeraN7 @giuseros, I like the approach of making the vectorized loop explicit with VL parameter at the TIR level, in contrast to how the fixed-width vectorization is done today. It would be great if you can add a section on why an alternative implementation strategy was chosen. Perhaps to make the LLVM codegen simpler?

If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, s[C].vectorize(...) is already vector-length agnostic.

Aug 27 '21 07:08 masahi

@masahi, about:

If possible, I think it is better not to introduce user facing changes, since as far as an API is concerned, s[C].vectorize(...) is already vector-length agnostic.

which I think is very closely related to an earlier inline comment:

As I commented above, I'd like to continue using s[C].vectorize(...) and when the feature is available, enable SVE by a target attribute. So I don't expect any user facing work.

I think we do need a user-facing option to toggle fixed/scalable vectorisation. If the vectorisation strategy is selected based on an available target attribute, we loose control to choose fixed/scalable vectorisation. For architectures that do support scalable vectorisation, fixed width might still be preferable in some cases.

I think this is similar to Clang's loop pragmas. For example, the vectorise_width pragma has been extended with an optional second argument fixed|scalable:

vectorize_width(_value_[, fixed|scalable]),

see also the Clang docs here.

So I see two approaches:

we extend s[C].vectorize(...) to take an optional fixed/scalable boolean value, similar to Clang's loop pragma, which defaults to fixed if omitted,
or introduce s[C].vectorize_scalable(...) as proposed in this RFC.

I personally don't have any preference. But now I am wondering if extending s[C].vectorize(...), the first option, would be better. What do you think?

Sep 01 '21 12:09 sjoerdmeijer

Thanks @sjoerdmeijer @giuseros, I didn't imagine that there would be a case where mixing fixed and scalable vectorization is beneficial. I prefer s[C].vectorize(..., scalable=True) to s[C].vectorize_scalable(...) but both seem fine.

Any other comments @tqchen @tkonolige?

Sep 01 '21 22:09 masahi

Thanks @MeeraN7 @giuseros, to make the discussion more concrete, right now the IR after legalization looks like

  for (i: int32, 0, 17;i+=VL) {
    C_2[ramp(i, 1, VL)] = ((int8xVL*)A_2[ramp(i, 1, VL)] + (int8xVL*)B_2[ramp(i, 1, VL)])
  }

This would require changes such as the ramp data structure and data type to support the VL vector types, which can be a bit adhoc because there are also additional information needed to be encoded(e.g. this VL and that VL are the same) but nevertheless not clearly encoded here.

Given we are mostly matching a for pattern, I also wonder if they are really necessary. Since we could represent a VLA loop with some form of restricted for loop, with special annotations. Here is a possible alternative way to do so

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  }

And we will be defering the vectorized instruction generation to the codegen phase, by specially handling the patterns in the for that is annotated with VLA loop. Of course we can only support a limited set of patterns(such as read/write to the same vector index or limited reduction support), that is why legalize is needed to make sure the body of VLA for loop satiesfies the pattern.

In this way we can likely get a similar set of things without hacking into get a ramp with VL size

Sep 10 '21 01:09 tqchen

Hi @tqchen, thank you for the comment. To clarify a few things, VL is not added to the Ramp Node at all, it is simply a string that is used when printing TIR for visual representation. The only addition to the Ramp Node (and also Broadcast Node) is a boolean called "is_scalable" which should not affect anything else as a separate constructor was added. I don't think any other information needs to be added to these nodes or data types.

Sep 10 '21 11:09 MeeraN7

Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type.

This context dependency is the missing information I mentioned here. The set of code is really undefined and should be parameterized by VL. Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored.

I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast.

Specifically, a regular loop as follows would carry same amount of information(the access to i(VLA index)) would indicate a vector load, and access to other indices would become a normal load. The main constraint could be that we can only allow i to appear in certain locations(say in the inner most to represent a ramp like pattern), and defer the generation of SVE code to the codegen phase, by pattern matching the indices and lookup intermediate value:

N1: A possible loop normal form via annotation

  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  }

N1 would indeed push a bit more pressure to the code generator, because now code generator needs to pattern match load/store of VLA index(i), and possibly perform broadcast if necessary. However, the additional overhead may not be too large and could help us to keep the code cleaner. My guess is that this approach is also easier to generalize to later loop patterns such as scalable matrix instruction, in which case we cannot really reuse ramp and broadcast

Sep 10 '21 13:09 tqchen

Thanks for commenting @tqchen Could you further clarify a few things for me please? See remarks inlined.

Thanks @MeeraN7 . Yes I get what you mean. Right now we are adding a "is_scalable" field to indicate that the broadcast and ramp are "context dependent" on VL. Additionally, we might need to update DataType to indicate a scalable data type.

This context dependency is the missing information I mentioned here.

I don't think I understand what you mean by context dependency. Basically, in my view of the world, the Ramp node means that we can process elements in a data parallel way. How exactly this is done, is up to backends depending on the architecture and the code. What we are doing here is annotating this Ramp node with a bit of state, which is a hint that we want a special kind of vectorisation. From this point of view it is syntactic sugar: if the hint is dropped or ignored, it could still be vectorised, but not in a vector length agnostic way.

I don't think we need to encode much more information than one boolean that specifies the vectorisation style. You could indeed argue that this is ad-hoc, but if we are going to do it in a different way, we would still need to keep a bit of state around, like the annotation={"VLA"} example that you gave earlier. From that point of view, I don't see any difference.

The set of code is really undefined and should be parameterized by VL.

What do you mean by undefined here?

Additionally, considering the case of two loops with different VL1 and VL2 and want to do some transformations, we might fall into the trap of thinking them as same type(because only "is_scalable" is marked) but in reality they are not, as a implicit dependency on VL can be ignored.

I can understand that the additional flag can be helpful as we could reuse some of the vectorization logic. However, the "is_scalable" field might introduce additional confusion as above, and the additional ramp node may not carry too much additional information(apart from the fact that we use a scalar vs a vector type). So my main question is that whether or not we could use a separate normal form to hint the code generator without changing the current DataType, ramp and broadcast.

Correct me if I am wrong, but your assumption is that explicit scalable state is bad. Looking at your example:

N1: A possible loop normal form via annotation
  for (i: int32, 0, 17;i, annotation={"VLA"}) {
    C_2[i] = A_2[i] + B_2[i];
  } 

This annotation looks equivalent to a loop pragma. In Clang you can for example do:

 #pragma loop vectorize_width(4, scalable)
  for (I =0; I<17; ++i) {
    C_2[i] = A_2[i] + B_2[I];
  }

and thus request scalable vectorisation. If this annotation results in scalable vectorisation, the LLVM vectoriser will lower this to operations using <n x 4 x i32> scalable vector IR types.

What I would like to say with these examples, is that there are 2 things going on at different levels. I think:

The annotation={"VLA"} corresponds to a loop pragma in Clang,
The TIR Ramp node extension corresponds to LLVM IR scalable types, e.g. <n x 4 x i32>.

And I think these 2 concepts are different things that both have their value and place.

I we can only annotate loops as being scalable, we loose the finer grained control to request this on a statement level. I don't know if mixing fixed and scalable will be an important use-case, but I think it is possible.

Summarising, I don't think explicit encoding of scalable in TIR nodes is a bad thing, the opposite actually, I think we need it, and the annotation on the loop might be a complementary technique to this.

What do you think?

Sep 12 '21 07:09 sjoerdmeijer

Thanks @sjoerdmeijer , sorry for getting back to this late. If LLVM also encodes the SVE vector as a special type(and not tying the n) it could be a precedence that we can learn from. I do want to point out that the same approach won't work when it comes to scalable matrix instruction, so it would be good to think about a loop pattern based approach from that angle.

If we conclude that we really want to introduce the scalable vector into data type and so on. A remaining thing we need to resolve is the encoding. DataType and DLDataType are part of DLPack standard, used as standard ABI convention. So ideally we do not want to change the type signature of DataType DLDataType. What we might be able to is to introduce a magic lane number (say DataType::kScalableVectorLaneMark = -1) that indicate a scalable vector type, and also use similar approaches for other language construct as well. This would alleviate the problems from that side.

This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector.

Thank you for all the discussions so far!

Sep 29 '21 19:09 tqchen

Hi @tqchen , thanks for the great feedback!

This being said. It would still be good to get @MeeraN7 @giuseros @sjoerdmeijer 's take on the possibility of using a normalized loop pattern vs embedding into the vector.

Yes, I can see that as being very useful. If my analogy is correct, the loop annotation corresponds to a loop pragma, and the scalable ramp/vector node to an IR extension. Like I wrote before, I think there's place for both of these concepts.

Can you please advise how to proceed? I guess we need to document this in the RFC. But do we want to prototype this too? Or do we just list it as an alternative, and discuss here what we would like to do first?

I have not yet commented on your remark about the ABI convention, as I am not familiar enough with TVM to comment about this. It looks though that is a solvable problem.

Oct 04 '21 07:10 sjoerdmeijer

The ABI issue is important since it affects the DLPack, so I would suggest we agree on the encoding convention before we proceed. I am fine with having a scalable vector repr if we can take the right encoding.

Oct 04 '21 12:10 tqchen

DataType and DLDataType are part of DLPack standard,

The DataType type is defined in TVM, it's not present in DLPack. Or am I missing something?

Oct 11 '21 17:10 kparzysz-quic

@kparzysz-quic DatatType is a thin wrapper around DLDataType

Oct 11 '21 17:10 tqchen

Right, but users of DLPack will not see DataType, only DLDataType, unless they use TVM as well. Changing DataType will only affect those that use include/tvm/runtime/data_type.h (i.e. the ABI breakage will be limited to users of TVM).

Oct 11 '21 17:10 kparzysz-quic

@kparzysz-quic this is right. On the other hand, we still on conversion of DLDataType and DataType(e.g. getting DataType from ndarray) in many cases. So ideally we want to keep them consistent

Oct 11 '21 20:10 tqchen

A belated thank you for sharing further thoughts on this @tqchen and @kparzysz-quic . I am on holiday this week (and a little bit of next), but want to pick this up soon after that. In the mean time, some remarks/questions from my side.

First of all, the ABI discussion goes over my head at the moment to be honest; I am not familiar enough yet to comment or address this. My understanding from the discussion so far is, that there will be an ABI break, but it is rather limited. So that doesn't require any further attention here, would that be a correct summary?

Perhaps more fundamental is @kparzysz-quic 's remark:

This RFC is not about SVE. It's about vectorizing (countable) loops with bounds not known at compile time. I think that would be very valuable for all targets and we can determine the list of necessary features that a target should satisfy to implement it.

Even though SVE can be used to implement it, it only applies to ARM, and there is no compelling reason to introduce SVE-specific features to TIR. There may be room for unknown-length vectors in TIR, but that should be done separately from SVE.

Yep, I agree, and you're right about this: this is not really about SVE. The "only" addition to TIR is the is_scalable part, which I would not consider to be SVE-specific, because it indeed just represents an unknown length vector and any back-end can deal with that in any way they like (and there are quite a few architectures that now support scalable-like vectors). So just for my understanding, would you mind elaborating on your thoughts here? For example, is it just a naming issue and we need to find another term, or is it more fundamental how we would like to do things in TIR? Thanks!

Oct 27 '21 09:10 smeijer1234

@smeijer1234 to be more specific. I was recommending an approach that won't break the ABI.

Introdice

DataType::kScalableVectorLaneMark = -1

And in DataType, introduce a method IsScalable(), which checks if the lane equals -1 and if yes return true. This way we reuse the data layout of the original data structure without introducing the is_scalable field.

Oct 27 '21 14:10 tqchen

Hi @tqchen @kparzysz-quic @masahi @tkonolige @smeijer1234 ,

We are looking to revive this work. I have gone through the thread. Summary so far is as follows :

We want to introduce/enhance a scheduling vectorization primitive that could be controlled by user/auto-tuner/auto-scheduler either to use scalable vectors in the backend codegen.
- The conversation has resolved to be extending the existing vectorize scheduling primitive i.e. : s[C].vectorize(..., scalable=True)
Usage of this scheduling primitive should result in creating a for loop with a Ramp nodes with either an additional argument "is_scalable" or special number for lanes.
- I think @tqchen was suggesting to use the special lane number (-1) as opposed to introducing an additional argument to all TIR nodes such as Ramp and Broadcast as well as DataType (and to DLDataType) to avoid ABI breakages.
- Moreover, VectorizeLoopScalable will be modified to create a While node.
The name of the RFC is confusing ? @kparzysz-quic . I suppose for TIR, what we are adding is vector-length agnostic vectorization support for TIR, while demonstrating the codegen of VLA vectorized TIR using Arm(R) SVE instructions in the codegen.

Please confirm whether this is a right summary of the current state. As for next steps, I would like to propose/resolve each of the outstanding discussion points and update the RFC.

Jun 21 '22 17:06 manupak

@manupa-arm i think your summary is roughly correct.

@tqchen @masahi @kparzysz-quic perhaps you guys could have a look and see if we could resolve the question of how to represent the scalable concept in DLDataType. also cc @alter-xp in case they would like to leverage this for RISC-V (i'm aware you guys are doing a BYOC impl, but perhaps this could be useful if the approach goes beyond that in the future).

i do tend to agree with @kparzysz-quic that we could make the title a bit more generic since the DLDataType question is at the center of this RFC.

Jun 30 '22 20:06 areusch

Thanks for bringing up. this is also very useful on RISC-V. we look forward to progress in this regard.

Jul 01 '22 03:07 alter-xp

To reiterate---my original concern was that the first RFC was proposing changes to target-independent part of TVM to add support for a very target-specific feature. However, I do think that we can move this forward in way that would be overall useful.

Here is the outline of my thoughts on this. Let me know what you think.

First, a couple of observations:

Architectures that support vectors can be assumed to also support vector predication. I'm talking specifically about masked operations, and in particular about predicated loads and stores.
For ARM/AArch64, it may be beneficial to distinguish vectorization via fixed-length vectors from one via scalable vectors. If this choice is to be made by auto-scheduling, it should be expressible in TIR.

What this RFC proposes is very close to allowing vectorization of countable loops with variable iteration count, and I insist that we keep this in mind as a goal.

The way that vectorization works right now is that a loop like

for (i : [0, 130)) {
  C[i] = A[i] + B[i]
  D[i] = A[i] * B[i]
}

will be replaced with statements

C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]

The expressions within these statement are all PrimExpr, whose type must be expressible by DataType. All parameters in DataType are compile-time integers, which means that a single statement can only represent vectors with a known number of lanes. In other words, neither VIC nor VLA can be implemented without some changes. These changes may be in how types are represented in DataType, or in how vectorization is done (or a combination of these two).

We are already considering a special value for DataType::lanes that would represent the yet-unknown vector length (VL). Following Halide's approach to vectorization, I propose that we change vectorization to take an explicit vector length as a parameter. As a special case for SVE, the scalable VL could be represented by the same constant we chose for DataType::lanes. For compatibility with existing code, stage.vectorize() would be equivalent to stage.vectorize(vector_length=iter_count), since currently only loops with known iteration count can be vectorized. The argument value vector_length=VL would indicate using SVE. With vectorize(vector_length=32), the loop above would be turned into

for (i = [0, (130+31)/32) {
  // i-th vector is [32*i..32*(i+1))
  C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] = A[Ramp..., pred=...] + ...
  ...
}

If the loop iteration count changed from a known integer 130 to some expression N, the generated code would remain mostly the same: the structure does not depend on the fact that 130 is a compile-time constant. Similarly the 32 indicating vector length could be replaced with the predefined value for "scalable vector length", with the only issue potentially with calculating the iteration count of the for loop above. If we were to allow an explicit "stride" to For, the issue would go away (the RFC proposes something like that).

To summarize:

Introduce kScalableVectorLaneMark (as suggested by @tqchen).
Make vector length a parameter to stage.vectorize.
Introduce "predicate" to BufferLoad and BufferStore.
Allow non-unit strides in For loops (as per the RFC).

Jul 01 '22 18:07 kparzysz-quic

Hi~ here are my two questions :) cc @kparzysz-quic

2. Make vector length a parameter to stage.vectorize.

What is the different between
- sch[C].vectorize(v, vector_length=32) and
- vo, vi = sch[C].split(v, 32) then sch[C].vectorize(vi)
It seems that we could also choose to properly lower the split's predicate to reach the same goal as proposed below. For example, weapons introduced in RFC https://github.com/apache/tvm-rfcs/pull/77 may help?
3. Introduce "predicate" to BufferLoad and BufferStore.

Our team also get confused on how to represent predicated ld/st, when several months ago the upstream upgrade T.load/T.store (who have 1D predicate field) to BufferLoad/BufferStore. Now since BufferLoad/BufferStore are multi-dimensional, the predicate seems to also be multi-dimensional field?

Another concern is whether embedding predicate into BufferLoad/BufferStore increase the complexity (or break) buffer region related analysis in existing implementations. Could we leverage T.select(pred, A[...], undef) to represent A[..., pred], or just match the predicated memory access pattern like if (pred) C[...] = ...?

Thanks!

Jul 03 '22 03:07 wrongtest-intellif

tvm-rfcs tvm-rfcs copied to clipboard

[RFC] Adding initial SVE implementation

tvm-rfcs
tvm-rfcs copied to clipboard