Polygeist
Polygeist copied to clipboard
[RFC] C/C++ compiler directives for MLIR dialect lowering
Recently I keep wondering if it is nice for Polygeist to have such a feature: in the source C/C++ code, we can provide custom directives that instructs Polygeist to lower which C/C++ function to which MLIR function.
A motivating example
One example is, suppose we have a program that calls a Convolution function (yes, it is a CNN):
void ConvNet(float *X, float *W1, float *W2, float *Y1, float *Y2) {
Convolution(X, W1, Y1);
Convolution(Y1, W2, Y2);
}
Instead of providing an implementation to Convolution in C/C++, could we just define its interface, and point that to @linalg.conv in the following form:
#pragma lower_to("@linalg.conv")
extern void Convolution(float *filters, float *input, float *output);
Given this information, Polygeist can smartly lower the C/C++ code into:
func @ConvNet(memref<?xf32> %X, memref<?xf32> %W1, memref<?xf32> %W2, memref<?xf32> %Y1, memref<?xf32> %Y2) {
call @linalg.conv(%X, %W1, %Y1)
call @linalg.conv(%Y1, %W2, %Y2)
return
}
Why we need this feature?
The lowering scenario that Polygeist support is mainly to scf + std, with or without raising to affine.
This should be sufficient if we only deal with code that has all the functions (of interest, i.e., exclude things like printf) fully implemented, e.g., kernels in Polybench.
If the implementation is not available, what we can do is only declaring the callee as a private function.
What if we have a clear mind of what that unimplemented function should be lowered to in MLIR?
The MLIR counterpart of that function could be in one of the "official" dialects, e.g., linalg, or others not very official, e.g., mhlo, or even some DSL you invent in MLIR.
That MLIR counterpart might already have its well-optimized lowering mechanism implemented.
In that case, instead of providing a C/C++ implementation of that function and lowering that to affine/scf/std, it can be much better to just lower that function to its MLIR counterpart, since we can save the C/C++ implementation time and the optimization effort.
How to implement?
Suppose the compiler directive is #pragma lower_to("<MLIR function symbol>"), there are two things we should do to implement the whole feature:
- Build the mapping from C/C++ function to the symbol of its MLIR counterpart based on the directive;
- Create the MLIR function call.
The first part seems to be straightforward, while the second is not.
The biggest challenge is (AFAIK) to handle the differences between the operand types in C and MLIR.
Suppose we have an operand x has type Tc in the source C/C++, and that type can be mapped to Tm1 in MLIR, and the desired argument type in the target MLIR function is Tm2.
There are the following 3 scenarios:
- If
Tchas only one available mapping toTm1, andTm1is equivalent (or can be safely cast) toTm2, then things should be fine, we just need to insert typecast operations on demand. - If
Tchas multiple valid mappings, and some of them are equivalent (or can be safely cast) toTm2, then we should probably rank these choices and select the best option, which is not a trivial task. - If any
Tcmapping cannot reach toTm2, then the type check should fail.
To me I think there should be a viable solution overall, and these challenges can be addressed given some time for consideration and implementation, but still achievable.
Summary
I'm thinking of enabling Polygeist to process special compiler directives that maps C/C++ function interfaces to specific MLIR functions.
I'm going to give it a go in the following weeks. Please let me know if you have better solutions/ideas!
I would like to start a discussion on how to develop this lower_to pragma properly. I did some experiments in the directions highlighted by this RFC (see #57), applying the pragma to generate operations from the linalg dialect. To do so, I had to extend the pragma with "input" and "output" fields, which represent the input and output operands of the linalg operation. We need these fields to explicitly indicated input and output, as we cannot understand this simply by looking at the function operands. Here is an example of generating a named copy op from the linalg dialect.
#pragma lower_to(copy_op, "linalg.copy") "input"(a), "output"(b)
void copy_op(int b[3][3], int a[3][3]) {
for (int i = 0; i < 3; i++)
for (int j = 0; j < 3; j++)
b[i][j] = a[i][j];
}
int main() {
int a[3][3];
int b[3][3];
// CHECK: linalg.copy({{.*}}, {{.*}}) : memref<3x3xi32>, memref<3x3xi32>
copy_op(a, b);
return 0;
}
After this extension is fairly easy to emit linalg named ops, but I hit a problem when generating linalg generic. Specifically, a linalg generic op requires index maps for each input and output view, but we do not have them at this point. Do you have any suggestions, on how to emit linalg generic op? Or should we restrict to linalg named op only?
I wonder if perhaps the right place to do this is not as much on the call declaration, but to define a special pragma that results in an op execution itself.
Maybe something like
int a[3][3];
int b[3][3];
#pragma op linalg.generic "input"(a) "output"(b)
This way we could use various runtime parameters as part of the op?
@wsmoses Hi Billy, not sure what you mention with "runtime parameters". Current problems I see when emitting linalg generic ops are: 1) linalg generic requires index maps to understand the mapping between loops and input/output buffers, how can we get this if we do not bind the pragma to a given function? Still after binding to a function how we can get the access maps at this point in our codegen?. 2) Linalg generic requires you to define the computation. Again this means that we need to bind the pragma to some functions or nested loop. From the example you made above I am not so sure how we can get this information.