loopy
loopy copied to clipboard
Vectorisation
I would like to merge TJs work on introduction of a C vector target in order to enable vectorisation. PyOP2 and firedrake pass all tests with this branch. If possible I would like to merge this into master, since it should not depend on the callables structure introduced in the firedrake branch.
Hi Sophia,
Thanks for the PR. Could you please point out what additional features does c_vec
provide that's not already present in the vec
iname tags.
IIUC the advantage of the c_vec
tag over vec
is it emits cleaner code compared to vec
by making use of language extensions provided by GCC / Clang.
In my understanding vec iname tags are only available for OpenCL/Cuda target, am I mistaken? The cvec tag allows to use C vector extensions.
Could you explain how c_vec
differs semantically from vec
?
Off the cuff, rather than introduce a new tag, I think I would prefer realizing the existing vec
tag the way you realize c_vec
now in the target you're adding.
I had a look at where the code generation differs dependent on the tags.
- Dependent on the tag (CVectoriseTag or VectorizeTag) sequential or vectorised loop code is generated here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/codegen/control.py#L123
- The access info to arrays differs between the dim_tags (VectorArrayDimTag or CVectorArrayDimTag) here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/kernel/array.py#L1350
- and here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/kernel/array.py#L441 the code generation differs as well between the dim_tags, but I don't quite understand yet what the function is doing.
I have asked @tj-sun to join the discussion, he is the owner of this code and can probably explain better why he decided to do it this way.
OIC, thanks for those links. In that case, I'm not sure how cvec
(as an array axis tag) is different from a normal (strided) array axis. And I don't really understand what CVectorizeTag
(as an iname tag) does. Could you explain?
Hi,
It's been a while since I last looked at this so I might end up saying something stupid. But the idea of c_vec
is indeed an abstraction of vector extensions in C, i.e. to suppose the use cases such as:
typedef double double4 __attribute__ ((vector_size (32)));
double4 a, b, c;
// use as vector
c = a + b;
// but can also access individual elements
for (int i = 0; i < 4; i++)
c[i] = a[i] + b[i];
I'm not sure if we can achieve this by specialising vec
for CTarget
, but I seems to recall that I tried and ran into some problems with vec
.
The iname tag was introduced mostly because there are some special cases where such vectors can't be used, e.g.
c = sin(b); // can't compile
c = 0.0; // can't compile
Let me know what you think?
Thanks @tj-sun for your help! I'm still of the mind to address the problems you encountered (and allow a unified "vocabulary" across targets) rather than introduce a target-specific widget, at least since I can't think of a first-principles reason that would prevent this. To that end, do you have a vague sense of what the issues were on either end?
In my understanding you want to keep the target, you only request to use the same tag as for vectorisation on other targets, is that right? I will go ahead and try to make this work. To me it looks like the new tags were only introduced to go into different code paths for vectorisation depending on the targets. I think we should be able to use the VectorizeTag, and make the code path simply conditional on the kernel target. So for example for the snippet controlling the generation of the loop code here https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/codegen/control.py#L123, instead of the following
if filter_iname_tags_by_type(tags, (UnrollTag, UnrolledIlpTag)):
func = generate_unroll_loop
elif filter_iname_tags_by_type(tags, VectorizeTag):
func = generate_vectorize_loop
elif filter_iname_tags_by_type(tags, CVectorizeTag):
func = generate_sequential_loop_dim_code
we would do something like
if filter_iname_tags_by_type(tags, (UnrollTag, UnrolledIlpTag)):
func = generate_unroll_loop
elif filter_iname_tags_by_type(tags, VectorizeTag):
if isinstance(kernel.target, CVecTarget):
func = generate_sequential_loop_dim_code
else:
func = generate_vectorize_loop
That sounds like a reasonable approach.
I remember that one thing to watch out is the codegen of indexing into a cvec
here:
https://github.com/sv2518/loopy/blob/cda1d551c7b607915d9d6aaea9d4edef132d13a2/loopy/target/c/codegen/expression.py#L962
The index is put into the subscripts, and it should just print as a 1D array: a[0]
, I can't remember what happens when using vec
in CTarget
, but you probably need a slightly different mechanism as vec
has a separate index field?
Also the iname tag c_vec
is an IlpBaseTag
and vec
is an HardwareConcurrentTag
, which seems more restrictive.
The codegen of the indexing there is working on the access information to the array if I see that right. I would simply condition on the target in get_acess_info
as well, I don't think that is a problem.
@kaushikcfd / @inducer can you explain the difference between IlpBaseTag
and HardwareConcurrentTag
? Do you think it would be a problem to switch from IlpBaseTag
to HardwareConcurrentTag
for the iname tag on the CVectorTarget
?
Okay, done that. We have got "unified vocabulary across targets" now by using the VectorizeTag
also on CVecTarget
. Let me know what you think.
More feedback welcome.
An even simpler implementation is possible by handling the vector extensions similar to OpenCLTarget
's support for vector types (like floatn, doublen
). I have demonstrated it in the c_vecextensions_target branch.
I am curious to know if there's any feature in the implementation here that cannot be expressed through the implementation in the c_vecextensions_target
branch?
Thanks.
Okay, I am currently trying to understand the difference between our branches. Can I just ask a few questions?
- The way the vector extensions are define in your branch is
base_type __attribute__((vector_size(byte_count)))
and you use that for every temporary declaration, we go over a typedef in the preabmbletypedef base_type vector_type __attribute__ ((vector_size(byte_count)));
I think going over the typedef would be a bit neater, or is there any disadvantage to that?
Just a little question, you keep track of the new types of declarations, but why do you do it twice? You set it as class attribute here already https://github.com/inducer/loopy/blob/f81f3cd373f46442554b6ed96aabe08355fec572/loopy/target/c_vector_extensions.py#L67 but then you also save it into a dict that is also in saved in the class https://github.com/inducer/loopy/blob/f81f3cd373f46442554b6ed96aabe08355fec572/loopy/target/c_vector_extensions.py#L69
- Your
VectorExtensionASTBuilder
does not have the functionsemit_sequential_loop
andgenerate_temporary_decl
as ours does, former is used for the fallback option of simd pragmas, latter is used to declare global constants.
The former we definitely need, in case expressions can't be vectorised with help of the vector extensions. That's one of the big difference between our two branches.
And I think, instead of the latter, you made use of add_vector_access
in the ast builder. The generated code for a declaration is different however, for a 10 by 1 vector the code in your branch produces
for (int32_t i = 0; i <= 9; ++i)
{
temp1[0] = x[4 * i];
temp1[1] = x[1 + 4 * i];
temp1[2] = x[2 + 4 * i];
temp1[3] = x[3 + 4 * i];
}
In our branch this would look like
for (int i3 = 0; i3 <= 9; ++i3)
temp1[i3] = vector_value;
where vector_value is of size 4, so that an access to the second element of the say third batch would be t2[1][2]. So our storage format is a) transposed to yours, plus in your case the access the array would be expressed in c as t2[2+4*1], is that right?
I think this difference in storage format is also why we need the ExpressionToCVecExpressionMapper
and CVecExpressionToCodeMapper
and you do not.
Would you be able to change the storage format and add the fallback OpenMP pragma option in your branch without ending up doing what we did?
- Another little question: What's your arguments against using a
DtypeRegistryWrapper
for the vector types? Surely that it is nicer to encapsulate this in specialisedDtypeRegisterWrapper
, than just calling_register_vector_types
? - Another major difference is that rather than
generating_seq_loop_dim_code
, you go overvectorise_loop
here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/codegen/control.py#L125. I don't quite understand yet in which ways the code generation of these two differ, maybe someone else can comment on this, otherwise I will have a closer look
Besides that your branch does not contain the fallback option of OpenMP SIMD pragmas and the difference in storage format, another big difference is the preprocessing work we do in realize_c_vec
before generating the code. Some expressions need some transformations, before the compiler wants to vectorise them. One example is, if you have a constant right hand side, we need to add a vector valued 0 to the expression.
And another thing is, that how many data of one type you can fit into one vector depends on the vector extensions available on your machine. We pass this information through from PyOP2, I think you don't consider this in your branch.
Thanks for taking a look into the alternative implementation!
I think going over the typedef would be a bit neater, or is there any disadvantage to that?
True. That can be implemented by overriding the preamble generator of the target's AST builder and maintaining the seen_dtypes
.
but then you also save it into a dict that is also in saved in the class..
Yep, that seems unnecessary and should be removed. I copied it over from loopy/target/opencl.py
and didn't notice it.
Your VectorExtensionASTBuilder does not have the functions emit_sequential_loop and generate_temporary_decl as ours does, former is used for the fallback option of simd pragmas, latter is used to declare global constants.
In such cases, I would prefer if the user realizes that the instruction cannot be vectorized and tags the iname with the fallback tag. This way the code generation process is more transparent to the user.
The generated code for a declaration is different however, for a 10 by 1 vector the code in your branch produces
- I think the indexing you provided is incorrect, in the first kernel it should be
temp1[4*i3+1]
- The assembly of both the kernels would be similar i.e. would have
vmov
operations.
How many data of one type you can fit into one vector depends on the vector extensions available on your machine. We pass this information through from PyOP2, I think you don't consider this in your branch.
This isn't needed, as that's inferred from the iname bounds of the loop being vectorized. This freedom is also allowed in the underlying target as one can have types of double __attribute__((vector_size(16)))
on processors with AVX support.
Some expressions need some transformations, before the compiler wants to vectorise them. One example is, if you have a constant right hand side, we need to add a vector valued 0 to the expression.
In this case the loop responsible for zero-ing the array should be tagged with unr
and not vec
and the compiler vectorizes it. I have compared the 2 approaches here.
Would you be able to change the storage format and add the fallback OpenMP pragma option in your branch without ending up doing what we did?
I think before answering this, we need to figure out if we intend to using pragma omp_simd
as the fallback for vec
tags implicitly. I would lean towards "no", but others can voice their opinions.
True. That can be implemented by overriding the preamble generator of the target's AST builder and maintaining the seen_dtypes.
Okay, thanks! I will try to rewrite our preamble generation with consideration of your fix.
This isn't needed, as that's inferred from the iname bounds of the loop being vectorized. This freedom is also allowed in the underlying target as one can have types of double attribute((vector_size(16))) on processors with AVX support.
Aha, okay! I will try to clean to this up a little then.
In such cases, I would prefer if the user realizes that the instruction cannot be vectorized and tags the iname with the fallback tag. This way the code generation process is more transparent to the user.
Okay, since PyOP2 is basically the user from your perspective, we would need to collect the inames that need to be retagged, and provide feedback about it. How would you like to do that? loopy.generate_code_v2
is our trigger for the loopy process. Maybe I can collect all the inames here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/preprocess.py#L2195 and then throw an Error, so that in PyOP2 we can say
try:
loopy.generate_code_v2(...)
catch(CVecError):
retag_from_vec_to_simd(kernel, to_be_retagged_inames)
loopy.generate_code_v2(...)
I don't think this is ideal however, it means we need to run twice through your pipeline, at least partially.
Maybe we can port the check, if the expressions are vectorisable with the vector extensions or or simd pragmas into a function on it's own, say def check_vectorisability(...)
and invoke that before we call loopy.generate_code_v2(...)
.
Are there other options?
I think the indexing you provided is incorrect, in the first kernel it should be temp1[4*i3+1]
I actually copied the first code snippet from the kernel in the print on your branch. The full kernel produced by your branch is
#include <stdint.h>
void loopy_kernel(double const *__restrict__ x, double *__restrict__ y)
{
double __attribute__((vector_size(32))) temp1;
double __attribute__((vector_size(32))) temp2;
for (int32_t i = 0; i <= 9; ++i)
{
temp1[0] = x[4 * i];
temp1[1] = x[1 + 4 * i];
temp1[2] = x[2 + 4 * i];
temp1[3] = x[3 + 4 * i];
temp2 = 2.0 * temp1 + 1.0;
y[4 * i] = temp2[0];
y[1 + 4 * i] = temp2[1];
y[2 + 4 * i] = temp2[2];
y[3 + 4 * i] = temp2[3];
}
}
That looks to me like your storage format is temp1[batch_number+batch_size*element_number]
, our storage format is temp1[element_number][batch_number]
, so we don't use striding in this case (We would use striding in the first dimension, if the considered data were in matrix form). I think this is important because the code generation in map_subscript
differs see here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/target/c/codegen/expression.py#L962 and here https://github.com/sv2518/loopy/blob/e2684a5f5062fb5f7bfb5c5f092ceb3092c98e5b/loopy/target/c/codegen/expression.py#L1009
In this case the loop responsible for zero-ing the array should be tagged with unr and not vec and the compiler vectorizes it. I have compared the 2 approaches here.
I had a look at that and it is slightly different to what I meant to say. I meant to say, if you have an expression vector_valued_lhs=vector_valued_constant;
then in the preprocessing we transform that to vector_valued_lhs=vector_valued_constant+vector_valued_zero;
Maybe it's the same story, however, do you think vector_valued_lhs=vector_valued_constant;
would be vectorised by the compiler, if retagged to unr
? @tj-sun did you try this at the time you wrote the code?
Thanks for your reviews so far! I hope we can sort this out rather quickly, so I can continue with my own work on batched blas. Let me know, what you think about the ideas I had in regards to the OpenMP SIMD pragmas. I will also chat to Firedrake people about moving the retagging a level up to be controlled by PyOP2.
(about the implementation in PyOP2)
In PyOP2 the steps that would be needed (not very different from what's already implemented):
- Split
n
inton_batch
depending on the underlying arch. - Privatize the temporaries in
n_batch
loop- Tag the privatized array axes of the temporaries with
vec
dim tag
- Tag the privatized array axes of the temporaries with
- Tag
n_batch
withvec
iname tag
In this case loopy would emit a warning saying that the gather/scatter instructions of the PyOP2 kernel weren't vectorized and would unroll n_batch
for those instructions. (@tj-sun: Could you please confirm if this would capture all the desired vectorization opportunities?)
(about the discrepancy in the storage format)
temp
is not a multi-dimensional array in the provided example. Trying an example with multi-dimensional array realized as a vector type:
knl = lp.make_kernel(
"{[i, j1, j2, j3, k]: 0<=i<10 and 0<=j1,j2,j3<4 and 0<=k<2}",
"""
<> temp1[k, j1] = x[i, k, j1]
<> temp2[k, j2] = 2*temp1[k, j2] + 1 {inames=i:j2}
y[i, k, j3] = temp2[k, j3]
""",
[lp.GlobalArg('x, y', shape=lp.auto, dtype=float)],
seq_dependencies=True,
lang_version=(2018, 2),
target=lp.CVectorExtensionsTarget())
knl = lp.tag_inames(knl, 'j2:vec, j1:ilp, j3:ilp')
knl = lp.tag_array_axes(knl, 'temp1,temp2', 'c,vec')
Generated code:
void loopy_kernel(double const *__restrict__ x, double *__restrict__ y)
{
double __attribute__((vector_size(32))) temp1[2];
double __attribute__((vector_size(32))) temp2[2];
for (int k = 0; k <= 1; ++k)
for (int i = 0; i <= 9; ++i)
{
(temp1[k])[0] = x[8 * i + 4 * k];
(temp1[k])[1] = x[1 + 8 * i + 4 * k];
(temp1[k])[2] = x[2 + 8 * i + 4 * k];
(temp1[k])[3] = x[3 + 8 * i + 4 * k];
temp2[k] = 2.0 * temp1[k] + 1.0;
y[8 * i + 4 * k] = (temp2[k])[0];
y[1 + 8 * i + 4 * k] = (temp2[k])[1];
y[2 + 8 * i + 4 * k] = (temp2[k])[2];
y[3 + 8 * i + 4 * k] = (temp2[k])[3];
}
}
In PyOP2 the steps that would be needed (not very different from what's already implemented):
Split n into n_batch depending on the underlying arch. Privatize the temporaries in n_batch loop Tag the privatized array axes of the temporaries with vec dim tag Tag n_batch with vec iname tag In this case loopy would emit a warning saying that the gather/scatter instructions of the PyOP2 kernel weren't vectorized and would unroll n_batch for those instructions. (@tj-sun: Could you please confirm if this would capture all the desired vectorization opportunities?)
I think I don't understand this completely. I have two questions:
- In my understanding you want to move even more than the retagging from VectoriseTag to OpenMPSIMDTag to PyOP2? The kernel passed to loopy from PyOP2 is actually a program. Going through your pipeline has the advantage that we end up in preprocess_single_kernel and can do our transformations there (on a single kernel), rather than walking through the kernels in the program ourselves.
- Does your proposal imply that we would drop the fallback option of OpenMPSIMD pragmas completely?
Okay maybe I have got something working that fits to what you said. I outsourced the checks, if vectorizable with extensions, OpenMP pragmas or unrolling, into check_cvec_vectorizability(...)
. It gives back information about which instruction and iname pairs belong to which of the three categories. These information are then passed to cvec_retag_and_privatize
, which is responsible for all the substitutions. The according loopy branch is here https://github.com/sv2518/loopy/compare/cvec-minus-firedrake...sv2518:cvec-restructure-checks-minus-firedrake, how the new functions are called from PyOP2 you can find here https://github.com/OP2/PyOP2/compare/vectorisation...vectorisation-restructure-checks.
The only drawback is that I have to call realize_ilp
by hand before vectorising. Maybe we would also need to realise the reductions and so forth by hand. Previously this was happening by going through the standard preprocessing pipeline of loopy.
Let me know if this fits to what you had in mind. At least this exposes more of what is happening to the user: checking if vectorisable, retagging and privatising.
Does your proposal imply that we would drop the fallback option of OpenMPSIMD pragmas completely?
I agree omp_simd
tags are nice to have as a fallback. New proposal: have the class CVectorExtensionsTarget
take an input for the 'vec_fallback_tag' (defauled to unr
which is the current implementation).
In my understanding you want to move even more than the retagging from VectoriseTag to OpenMPSIMDTag to PyOP2?
Because this would lead to a complicated user script, I think there is a merit in having a fallback tag. (see above). So, in PyOP2 if we just have n_batch: vec
as the iname tags, this should be fine.
how the new functions are called from PyOP2 you can find here OP2/[email protected].
The interface is still a bit complicated than I was imagining. I was hoping on the PyOP2 end we do no more than spliting iname -> privatizing -> tagging arrays axes, tagging inames.
So the changes that would be needed on the loopy end would be to generalize the vectorization end of codegen so that it does not always fallback to unrolling.
A simple test case would be something on the lines of:
import loopy as lp
knl = lp.make_kernel(
"{[i]: 0<=i<4}",
"""
<> x[i] = 1.0f
<> y[i] = 2.0f
<> z[i] = x[i] + y[i]
out[i] = z[i]
""", lang_version=(2018, 2),
target=lp.CVectorExtensionsTarget(vec_fallback_tag='omp_simd'))
knl = lp.tag_inames(knl, 'i:vec')
knl = lp.tag_array_axes(knl, 'x, y, z', 'vec')
print(lp.generate_code_v2(knl).device_code())
which would generate the following code:
typedef float v4f __attribute__((vector_size(16)));
void loopy_kernel(float* __restrict__ out)
{
v4f x;
v4f y;
v4f z;
#pragma omp simd
for (int i=0; i<4; i++)
{
x[i] = 1.0f;
y[i] = 2.0f;
}
z = x + y;
#pragma omp simd
for (int i=0; i<4; i++)
out[i] = z[i];
}