loopy Vectorisation

Vectorisation

Open sv2518 opened this issue 3 years ago • 20 comments

I would like to merge TJs work on introduction of a C vector target in order to enable vectorisation. PyOP2 and firedrake pass all tests with this branch. If possible I would like to merge this into master, since it should not depend on the callables structure introduced in the firedrake branch.

Jul 23 '20 12:07 sv2518

Hi Sophia, Thanks for the PR. Could you please point out what additional features does c_vec provide that's not already present in the vec iname tags. IIUC the advantage of the c_vec tag over vec is it emits cleaner code compared to vec by making use of language extensions provided by GCC / Clang.

Aug 17 '20 16:08 kaushikcfd

In my understanding vec iname tags are only available for OpenCL/Cuda target, am I mistaken? The cvec tag allows to use C vector extensions.

Aug 19 '20 07:08 sv2518

Could you explain how c_vec differs semantically from vec?

Off the cuff, rather than introduce a new tag, I think I would prefer realizing the existing vec tag the way you realize c_vec now in the target you're adding.

Aug 19 '20 23:08 inducer

I had a look at where the code generation differs dependent on the tags.

Dependent on the tag (CVectoriseTag or VectorizeTag) sequential or vectorised loop code is generated here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/codegen/control.py#L123
The access info to arrays differs between the dim_tags (VectorArrayDimTag or CVectorArrayDimTag) here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/kernel/array.py#L1350
and here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/kernel/array.py#L441 the code generation differs as well between the dim_tags, but I don't quite understand yet what the function is doing.

I have asked @tj-sun to join the discussion, he is the owner of this code and can probably explain better why he decided to do it this way.

Aug 20 '20 08:08 sv2518

OIC, thanks for those links. In that case, I'm not sure how cvec (as an array axis tag) is different from a normal (strided) array axis. And I don't really understand what CVectorizeTag (as an iname tag) does. Could you explain?

Aug 20 '20 21:08 inducer

Hi,

It's been a while since I last looked at this so I might end up saying something stupid. But the idea of c_vec is indeed an abstraction of vector extensions in C, i.e. to suppose the use cases such as:

typedef double double4 __attribute__ ((vector_size (32)));
double4 a, b, c;
// use as vector
c = a + b;
// but can also access individual elements
for (int i = 0; i < 4; i++)
  c[i] = a[i] + b[i];

I'm not sure if we can achieve this by specialising vec for CTarget, but I seems to recall that I tried and ran into some problems with vec.

The iname tag was introduced mostly because there are some special cases where such vectors can't be used, e.g.

c = sin(b);  // can't compile
c = 0.0;  // can't compile

Let me know what you think?

Aug 20 '20 22:08 tj-sun

Thanks @tj-sun for your help! I'm still of the mind to address the problems you encountered (and allow a unified "vocabulary" across targets) rather than introduce a target-specific widget, at least since I can't think of a first-principles reason that would prevent this. To that end, do you have a vague sense of what the issues were on either end?

Aug 20 '20 22:08 inducer

In my understanding you want to keep the target, you only request to use the same tag as for vectorisation on other targets, is that right? I will go ahead and try to make this work. To me it looks like the new tags were only introduced to go into different code paths for vectorisation depending on the targets. I think we should be able to use the VectorizeTag, and make the code path simply conditional on the kernel target. So for example for the snippet controlling the generation of the loop code here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/codegen/control.py#L123, instead of the following

        if filter_iname_tags_by_type(tags, (UnrollTag, UnrolledIlpTag)):
            func = generate_unroll_loop
        elif filter_iname_tags_by_type(tags, VectorizeTag):
            func = generate_vectorize_loop
        elif filter_iname_tags_by_type(tags, CVectorizeTag):
            func = generate_sequential_loop_dim_code

we would do something like

        if filter_iname_tags_by_type(tags, (UnrollTag, UnrolledIlpTag)):
            func = generate_unroll_loop
        elif filter_iname_tags_by_type(tags, VectorizeTag):
            if isinstance(kernel.target, CVecTarget):
                 func = generate_sequential_loop_dim_code
            else:
                 func = generate_vectorize_loop

Aug 21 '20 07:08 sv2518

That sounds like a reasonable approach.

I remember that one thing to watch out is the codegen of indexing into a cvec here: https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/target/c/codegen/expression.py#L962

The index is put into the subscripts, and it should just print as a 1D array: a[0], I can't remember what happens when using vec in CTarget, but you probably need a slightly different mechanism as vec has a separate index field?

Also the iname tag c_vec is an IlpBaseTag and vec is an HardwareConcurrentTag, which seems more restrictive.

Aug 21 '20 08:08 tj-sun

The codegen of the indexing there is working on the access information to the array if I see that right. I would simply condition on the target in get_acess_info as well, I don't think that is a problem.

@kaushikcfd / @inducer can you explain the difference between IlpBaseTag and HardwareConcurrentTag? Do you think it would be a problem to switch from IlpBaseTag to HardwareConcurrentTag for the iname tag on the CVectorTarget?

Aug 21 '20 09:08 sv2518

Okay, done that. We have got "unified vocabulary across targets" now by using the VectorizeTag also on CVecTarget. Let me know what you think. More feedback welcome.

Aug 25 '20 07:08 sv2518

An even simpler implementation is possible by handling the vector extensions similar to OpenCLTarget's support for vector types (like floatn, doublen). I have demonstrated it in the c_vecextensions_target branch.

I am curious to know if there's any feature in the implementation here that cannot be expressed through the implementation in the c_vecextensions_target branch?

Thanks.

Aug 26 '20 17:08 kaushikcfd

Okay, I am currently trying to understand the difference between our branches. Can I just ask a few questions?

The way the vector extensions are define in your branch is base_type __attribute__((vector_size(byte_count))) and you use that for every temporary declaration, we go over a typedef in the preabmble typedef base_type vector_type __attribute__ ((vector_size(byte_count))); I think going over the typedef would be a bit neater, or is there any disadvantage to that?

Just a little question, you keep track of the new types of declarations, but why do you do it twice? You set it as class attribute here already https://github.com/inducer/loopy/blob/f81f3cd373f46442554b6ed96aabe08355fec572/loopy/target/c_vector_extensions.py#L67 but then you also save it into a dict that is also in saved in the class https://github.com/inducer/loopy/blob/f81f3cd373f46442554b6ed96aabe08355fec572/loopy/target/c_vector_extensions.py#L69

Your VectorExtensionASTBuilder does not have the functions emit_sequential_loop and generate_temporary_decl as ours does, former is used for the fallback option of simd pragmas, latter is used to declare global constants.

The former we definitely need, in case expressions can't be vectorised with help of the vector extensions. That's one of the big difference between our two branches.

And I think, instead of the latter, you made use of add_vector_access in the ast builder. The generated code for a declaration is different however, for a 10 by 1 vector the code in your branch produces

for (int32_t i = 0; i <= 9; ++i)
  {
    temp1[0] = x[4 * i];
    temp1[1] = x[1 + 4 * i];
    temp1[2] = x[2 + 4 * i];
    temp1[3] = x[3 + 4 * i];

}

In our branch this would look like

for (int i3 = 0; i3 <= 9; ++i3)
        temp1[i3] = vector_value;

where vector_value is of size 4, so that an access to the second element of the say third batch would be t2[1][2]. So our storage format is a) transposed to yours, plus in your case the access the array would be expressed in c as t2[2+4*1], is that right?

I think this difference in storage format is also why we need the ExpressionToCVecExpressionMapper and CVecExpressionToCodeMapper and you do not.

Would you be able to change the storage format and add the fallback OpenMP pragma option in your branch without ending up doing what we did?

Another little question: What's your arguments against using a DtypeRegistryWrapper for the vector types? Surely that it is nicer to encapsulate this in specialised DtypeRegisterWrapper, than just calling _register_vector_types?
Another major difference is that rather than generating_seq_loop_dim_code, you go over vectorise_loop here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/codegen/control.py#L125. I don't quite understand yet in which ways the code generation of these two differ, maybe someone else can comment on this, otherwise I will have a closer look

Besides that your branch does not contain the fallback option of OpenMP SIMD pragmas and the difference in storage format, another big difference is the preprocessing work we do in realize_c_vec before generating the code. Some expressions need some transformations, before the compiler wants to vectorise them. One example is, if you have a constant right hand side, we need to add a vector valued 0 to the expression.

Aug 31 '20 12:08 sv2518

And another thing is, that how many data of one type you can fit into one vector depends on the vector extensions available on your machine. We pass this information through from PyOP2, I think you don't consider this in your branch.

Aug 31 '20 12:08 sv2518

Thanks for taking a look into the alternative implementation!

I think going over the typedef would be a bit neater, or is there any disadvantage to that?

True. That can be implemented by overriding the preamble generator of the target's AST builder and maintaining the seen_dtypes.

but then you also save it into a dict that is also in saved in the class..

Yep, that seems unnecessary and should be removed. I copied it over from loopy/target/opencl.py and didn't notice it.

Your VectorExtensionASTBuilder does not have the functions emit_sequential_loop and generate_temporary_decl as ours does, former is used for the fallback option of simd pragmas, latter is used to declare global constants.

In such cases, I would prefer if the user realizes that the instruction cannot be vectorized and tags the iname with the fallback tag. This way the code generation process is more transparent to the user.

The generated code for a declaration is different however, for a 10 by 1 vector the code in your branch produces

I think the indexing you provided is incorrect, in the first kernel it should be temp1[4*i3+1]
The assembly of both the kernels would be similar i.e. would have vmov operations.

How many data of one type you can fit into one vector depends on the vector extensions available on your machine. We pass this information through from PyOP2, I think you don't consider this in your branch.

This isn't needed, as that's inferred from the iname bounds of the loop being vectorized. This freedom is also allowed in the underlying target as one can have types of double __attribute__((vector_size(16))) on processors with AVX support.

Some expressions need some transformations, before the compiler wants to vectorise them. One example is, if you have a constant right hand side, we need to add a vector valued 0 to the expression.

In this case the loop responsible for zero-ing the array should be tagged with unr and not vec and the compiler vectorizes it. I have compared the 2 approaches here.

Would you be able to change the storage format and add the fallback OpenMP pragma option in your branch without ending up doing what we did?

I think before answering this, we need to figure out if we intend to using pragma omp_simd as the fallback for vec tags implicitly. I would lean towards "no", but others can voice their opinions.

Aug 31 '20 17:08 kaushikcfd

True. That can be implemented by overriding the preamble generator of the target's AST builder and maintaining the seen_dtypes.

Okay, thanks! I will try to rewrite our preamble generation with consideration of your fix.

This isn't needed, as that's inferred from the iname bounds of the loop being vectorized. This freedom is also allowed in the underlying target as one can have types of double attribute((vector_size(16))) on processors with AVX support.

Aha, okay! I will try to clean to this up a little then.

In such cases, I would prefer if the user realizes that the instruction cannot be vectorized and tags the iname with the fallback tag. This way the code generation process is more transparent to the user.

Okay, since PyOP2 is basically the user from your perspective, we would need to collect the inames that need to be retagged, and provide feedback about it. How would you like to do that? loopy.generate_code_v2 is our trigger for the loopy process. Maybe I can collect all the inames here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/preprocess.py#L2195 and then throw an Error, so that in PyOP2 we can say

try:
     loopy.generate_code_v2(...)
catch(CVecError):
     retag_from_vec_to_simd(kernel, to_be_retagged_inames)
     loopy.generate_code_v2(...)

I don't think this is ideal however, it means we need to run twice through your pipeline, at least partially.

Maybe we can port the check, if the expressions are vectorisable with the vector extensions or or simd pragmas into a function on it's own, say def check_vectorisability(...) and invoke that before we call loopy.generate_code_v2(...).

Are there other options?

I think the indexing you provided is incorrect, in the first kernel it should be temp1[4*i3+1]

I actually copied the first code snippet from the kernel in the print on your branch. The full kernel produced by your branch is

#include <stdint.h>
void loopy_kernel(double const *__restrict__ x, double *__restrict__ y)
{
  double __attribute__((vector_size(32))) temp1;
  double __attribute__((vector_size(32))) temp2;

  for (int32_t i = 0; i <= 9; ++i)
  {
    temp1[0] = x[4 * i];
    temp1[1] = x[1 + 4 * i];
    temp1[2] = x[2 + 4 * i];
    temp1[3] = x[3 + 4 * i];
    temp2 = 2.0 * temp1 + 1.0;
    y[4 * i] = temp2[0];
    y[1 + 4 * i] = temp2[1];
    y[2 + 4 * i] = temp2[2];
    y[3 + 4 * i] = temp2[3];
  }
}

That looks to me like your storage format is temp1[batch_number+batch_size*element_number], our storage format is temp1[element_number][batch_number], so we don't use striding in this case (We would use striding in the first dimension, if the considered data were in matrix form). I think this is important because the code generation in map_subscript differs see here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/target/c/codegen/expression.py#L962 and here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/target/c/codegen/expression.py#L1009

In this case the loop responsible for zero-ing the array should be tagged with unr and not vec and the compiler vectorizes it. I have compared the 2 approaches here.

I had a look at that and it is slightly different to what I meant to say. I meant to say, if you have an expression vector_valued_lhs=vector_valued_constant; then in the preprocessing we transform that to vector_valued_lhs=vector_valued_constant+vector_valued_zero; Maybe it's the same story, however, do you think vector_valued_lhs=vector_valued_constant; would be vectorised by the compiler, if retagged to unr? @tj-sun did you try this at the time you wrote the code?

Thanks for your reviews so far! I hope we can sort this out rather quickly, so I can continue with my own work on batched blas. Let me know, what you think about the ideas I had in regards to the OpenMP SIMD pragmas. I will also chat to Firedrake people about moving the retagging a level up to be controlled by PyOP2.

Sep 01 '20 08:09 sv2518

(about the implementation in PyOP2)

In PyOP2 the steps that would be needed (not very different from what's already implemented):

Split n into n_batch depending on the underlying arch.
Privatize the temporaries in n_batch loop
- Tag the privatized array axes of the temporaries with vec dim tag
Tag n_batch with vec iname tag

In this case loopy would emit a warning saying that the gather/scatter instructions of the PyOP2 kernel weren't vectorized and would unroll n_batch for those instructions. (@tj-sun: Could you please confirm if this would capture all the desired vectorization opportunities?)

(about the discrepancy in the storage format)

temp is not a multi-dimensional array in the provided example. Trying an example with multi-dimensional array realized as a vector type:

knl = lp.make_kernel(
        "{[i, j1, j2, j3, k]: 0<=i<10 and 0<=j1,j2,j3<4 and 0<=k<2}",
        """
        <> temp1[k, j1] = x[i, k, j1]
        <> temp2[k, j2] = 2*temp1[k, j2] + 1 {inames=i:j2}
        y[i, k, j3] = temp2[k, j3]
        """,
        [lp.GlobalArg('x, y', shape=lp.auto, dtype=float)],
        seq_dependencies=True,
        lang_version=(2018, 2),
        target=lp.CVectorExtensionsTarget())

knl = lp.tag_inames(knl, 'j2:vec, j1:ilp, j3:ilp')
knl = lp.tag_array_axes(knl, 'temp1,temp2', 'c,vec')

Generated code:

void loopy_kernel(double const *__restrict__ x, double *__restrict__ y)
{
  double __attribute__((vector_size(32))) temp1[2];
  double __attribute__((vector_size(32))) temp2[2];

  for (int k = 0; k <= 1; ++k)
    for (int i = 0; i <= 9; ++i)
    {
      (temp1[k])[0] = x[8 * i + 4 * k];
      (temp1[k])[1] = x[1 + 8 * i + 4 * k];
      (temp1[k])[2] = x[2 + 8 * i + 4 * k];
      (temp1[k])[3] = x[3 + 8 * i + 4 * k];
      temp2[k] = 2.0 * temp1[k] + 1.0;
      y[8 * i + 4 * k] = (temp2[k])[0];
      y[1 + 8 * i + 4 * k] = (temp2[k])[1];
      y[2 + 8 * i + 4 * k] = (temp2[k])[2];
      y[3 + 8 * i + 4 * k] = (temp2[k])[3];
    }
}

Sep 01 '20 09:09 kaushikcfd

In PyOP2 the steps that would be needed (not very different from what's already implemented):

Split n into n_batch depending on the underlying arch. Privatize the temporaries in n_batch loop Tag the privatized array axes of the temporaries with vec dim tag Tag n_batch with vec iname tag In this case loopy would emit a warning saying that the gather/scatter instructions of the PyOP2 kernel weren't vectorized and would unroll n_batch for those instructions. (@tj-sun: Could you please confirm if this would capture all the desired vectorization opportunities?)

I think I don't understand this completely. I have two questions:

In my understanding you want to move even more than the retagging from VectoriseTag to OpenMPSIMDTag to PyOP2? The kernel passed to loopy from PyOP2 is actually a program. Going through your pipeline has the advantage that we end up in preprocess_single_kernel and can do our transformations there (on a single kernel), rather than walking through the kernels in the program ourselves.
Does your proposal imply that we would drop the fallback option of OpenMPSIMD pragmas completely?

Sep 01 '20 10:09 sv2518

Okay maybe I have got something working that fits to what you said. I outsourced the checks, if vectorizable with extensions, OpenMP pragmas or unrolling, into check_cvec_vectorizability(...). It gives back information about which instruction and iname pairs belong to which of the three categories. These information are then passed to cvec_retag_and_privatize, which is responsible for all the substitutions. The according loopy branch is here https://github.com/sv2518/loopy/compare/cvec-minus-firedrake...sv2518:cvec-restructure-checks-minus-firedrake, how the new functions are called from PyOP2 you can find here https://github.com/OP2/PyOP2/compare/vectorisation...vectorisation-restructure-checks.

The only drawback is that I have to call realize_ilp by hand before vectorising. Maybe we would also need to realise the reductions and so forth by hand. Previously this was happening by going through the standard preprocessing pipeline of loopy.

Let me know if this fits to what you had in mind. At least this exposes more of what is happening to the user: checking if vectorisable, retagging and privatising.

Sep 01 '20 15:09 sv2518

Does your proposal imply that we would drop the fallback option of OpenMPSIMD pragmas completely?

I agree omp_simd tags are nice to have as a fallback. New proposal: have the class CVectorExtensionsTarget take an input for the 'vec_fallback_tag' (defauled to unr which is the current implementation).

In my understanding you want to move even more than the retagging from VectoriseTag to OpenMPSIMDTag to PyOP2?

Because this would lead to a complicated user script, I think there is a merit in having a fallback tag. (see above). So, in PyOP2 if we just have n_batch: vec as the iname tags, this should be fine.

how the new functions are called from PyOP2 you can find here OP2/[email protected].

The interface is still a bit complicated than I was imagining. I was hoping on the PyOP2 end we do no more than spliting iname -> privatizing -> tagging arrays axes, tagging inames.

So the changes that would be needed on the loopy end would be to generalize the vectorization end of codegen so that it does not always fallback to unrolling.

A simple test case would be something on the lines of:

import loopy as lp

knl = lp.make_kernel(
        "{[i]: 0<=i<4}",
        """
        <> x[i] = 1.0f
        <> y[i] = 2.0f
        <> z[i] = x[i] + y[i]
        out[i] = z[i]
        """, lang_version=(2018, 2),
        target=lp.CVectorExtensionsTarget(vec_fallback_tag='omp_simd'))

knl = lp.tag_inames(knl, 'i:vec')
knl = lp.tag_array_axes(knl, 'x, y, z', 'vec')

print(lp.generate_code_v2(knl).device_code())

which would generate the following code:

typedef float v4f __attribute__((vector_size(16)));

void loopy_kernel(float* __restrict__ out)
{
  v4f x;
  v4f y;
  v4f z;
  #pragma omp simd
  for (int i=0; i<4; i++)
  {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  z = x + y; 
  #pragma omp simd
  for (int i=0; i<4; i++)
    out[i] = z[i];
}

Sep 04 '20 03:09 kaushikcfd

loopy loopy copied to clipboard

Vectorisation

loopy
loopy copied to clipboard