mlir [WIP] [spirv] Add Vulkan runtime.

This patch is addressing https://github.com/tensorflow/mlir/issues/60 Implements initial version of Vulkan runtime to test spirv::ModuleOp. Creates and runs Vulkan computation pipeline.

Provides one unititest to test multiplication of two spv.arrays of float types.

Requirements:

Vulkan capable device and drivers installed.
Vulkan SDK installed and VULKAN_SDK environment variable is set.

How to build:

Provide -DMLIR_VULKAN_RUNNER_ENABLED as cmake variable.

Note:

Tested only under the GNU/Linux distribution with official Vulkan SDK provided by https://vulkan.lunarg.com/doc/sdk/latest/linux/getting_started.html

@antiagainst @MaheshRavishankar can you please take a look? Thanks!

Sep 02 '19 12:09 denis0x0D

@antiagainst thanks a lot for review! I will update the patch regarding to the comments.

Sep 04 '19 20:09 denis0x0D

@antiagainst I've updated the patch regarding to your comments and have some thoughts about the error reporting machinery in particular and about the vulkan-runner in general,
the following text may contain repetitions with what has already been agreed, sorry about it, but I want make sure that I unrerstand all in the right way. Can you please comment on the following, please fix me if I'm wrong.

I was looking at the current realization of other runtime libs inside MLIR and they are using llvm:errs(), for example cuda runtime wrappers https://github.com/tensorflow/mlir/blob/master/tools/mlir-cuda-runner/cuda-runtime-wrappers.cpp#L34
It could be better for Vulkan runtime to be independent from MLIR context, for example the full pipeline for vulkan-runner could look like this:

2.1. Compile-time passes:

a. Pass which serializes spir-v module into binary and inserts it as an attribute, similar to GpuToCubinPass:

function.setAttr(kCubinAnnotation,
                   builder.getStringAttr({cubin->data(), cubin->size()}));

b. Pass which creates global constant containing spir-v binary, similar to GpuGenerateCubinAccessorsPass:

Value *startPtr = LLVM::createGlobalString(
        loc, builder, StringRef(nameBuffer), blob.getValue(), llvmDialect);
    builder.create<LLVM::ReturnOp>(loc, startPtr);

c. And the final pass which converts launch call into calls for Vulkan runtime wrappers, instruments calls for buffers registration into Vulkan runtime and so on.

2.2. Runtime part must consist of two parts: Vulkan runtime wrappers over actual runtime and actual Vulkan runtime which manages memory, descriptors, layouts, pipeline and so on. In this case the mlir module could look like this:

module {
  func @main() {
     // ...
    call @vkPopulateBuffer(%buff : memref<x?f32>, %desriptorSet : i32, %descriptorBinding : i32, %storageClass : i32) ...
    // ...
    call @vkLaunchShader(...) ...
    // ...
 }
// extern @vkPopulateBuffer
// extern @vkLaunchShader

Vulkan runtime wrappers could look like this:

extern "C" void vkPopulateBuffer(const memref_t buff, int32_t descriptorSet,
 int32_t decsriptorBinding, int32_t stroageClass) {

     VulkanRuntimeManager::instance()->registerBuffer(
      vulkanHostMemoryBuffer(buff.values, buff.length * sizeof(float)),
      decsriptorSet, descriptorBinding, storageClass);
}

VulkanRuntimeManager could look like this:

class VulkanRuntimeManager {
public:
  static VulkanRuntimeManager *instance() {
  static VulkanRuntimeManager *mng = new VulkanRuntimeManager;
  return mng;
}
void registerBuffer(VulkanHostMemoryBuffer buffer, int32_t set, int32_t binding, int32_t storageclass) {
 lock(m);
 runtime.registerBuffer(...)
}
void laucnhShader(...) ;
private:
mutex m;
VulkanRuntime runtime;
};

In this case we can at first register all needed information for Vulkan runtime such as resources (set, binding), entry point, number of work groups, then create device memory buffers, descriptor set layouts, pipeline layout and bind it to computation pipeline, create command buffer and finaly submit command buffer to the working queue, what do you think? Thanks.

Sep 08 '19 18:09 denis0x0D

Re: https://github.com/tensorflow/mlir/pull/118#issuecomment-529228871

Okay, using llvm::errs() sgtm. :)
What you said is reasonable for me!

Again, sorry for the delay on reviewing! There are quite a few interesting things happened in the meantime that relates to this. Our prototyping on the Vulkan runtime side, IREE, was open sourced. (I see you've already noticed that because you posted questions there. ;-P) It is a reflection of how we actually want to approach in a more Vulkan-native way. Compared to CUDA, which provides a more developer-friendly and middle-level host API abstractions, Vulkan's host API is more low-level and much verbose. We think by modelling it in IR form (not literally but selectively on core features), we can gain the benefits of running compiler optimizations on it. It is not fully proven yet but looks quite promising thus far. We are also thinking about how to structure MLIR core and IREE regarding the components. SPIR-V dialect is in MLIR core right now, but the lowering from high-level dialect used by IREE is not; it's inside IREE. That is partially because we don't have a proper modelling of Vulkan host side in MLIR core. Here different from CUDA again, a simple gpu.launch op that connects the host and device code does not really work for Vulkan (I mean yes we can still lower it to Vulkan but it's not gaining all expected benefits from Vulkan performance-wise). So it would be nice to have proper modelling on Vulkan in core so we can have the lowering in core and potentially share with other code paths for functionalities. Just wanted to point out recent developments to keep you informed where we are going; what you've here is certainly very valuable for testing and running in core and as building blocks for future Vulkan modelling. Thanks for the contribution again! :)

Nov 20 '19 22:11 antiagainst

@kiszk @joker-eph @antiagainst @MaheshRavishankar thanks for review, actually I was thinking that IREE covers that PR as well, and this PR is not needed anymore, but if it is still relevant to test lowering - that sounds great to me!

Nov 21 '19 11:11 denis0x0D

IREE is an independent project, I am not aware of a plan to integrate IREE within MLIR, so having end-to-end codegen story and testing capability seems still important to me.

Nov 21 '19 12:11 joker-eph

@joker-eph @antiagainst

having end-to-end codegen story and testing capability seems still important to me.

sounds great to me!

If it's ok, on the next iteration I'll rebase current patch on trunk, update according to the comments, delete c++ unit tests and leave this PR under WIP. After that I can start to work on a separate PR, the PR will implement a pass which:

Translates module into spv.module.
Serializes spv.module into binary form.
Sets serialized module as an attribute. This is similar to GpuKernelToCubinPass does https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertKernelFuncToCubin.cpp

Seems to me, to implement fully working mlir-vulkan-runner also requires one more PR to cover similar functionality with https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertLaunchFuncToCudaCalls.cpp and to update runtime part to be consistent with new passes and mlir-vulkan-runner. I can cover it on the future PRs. What do you think? If you see the better plan, please fix me in this case. Thanks!

Nov 22 '19 12:11 denis0x0D

+1 to what @joker-eph said in https://github.com/tensorflow/mlir/pull/118#issuecomment-557054792.

@denis0x0D: what you said SGTM! Just wanted to let you know, we are working on enabling the lowering flow from Linalg dialect to SPIR-V right now. I think that has a higher priority. We'll create issues for tasks soon and would really appreciate it if you can help there first. :) Sorry for keeping this thread open for even longer time, but I guess because this is entirely new code so it won't be much a problem to rebase against master branch later. Thanks!

Nov 22 '19 14:11 antiagainst

@antiagainst

we are working on enabling the lowering flow from Linalg dialect to SPIR-V right now. I think that has a higher priority. We'll create issues for tasks soon and would really appreciate it if you can help there first.

Sounds great, I would like to help on it if possible! Thanks!

Nov 22 '19 14:11 denis0x0D

Just catching up on all the discussion here now. Apologies for not paying more attention here.

As everyone agrees here, having a mlir-vulkan-runner in MLIR core is extremely useful for testing. So thanks again @denis0x0D for taking this up. Just to provide more context, as @antiagainst mentioned we have been trying to get all the pieces in place to convert from Linalg ops to GPU dialect to SPIR-V dialect. Some of the changes w.r.t to these will land soon (in a couple of days). I have also been making some fairly significant change to the SPIR-V lowering infrastructure to make it easy to build upon. (these should also land in a couple of days). While these changes don't directly affect the development of a vulkan-runner (the runner should be only be concerned about the spir-v binary generated by the dialect), they do motivate some requirements on the vulkan-runner summarized below.

As structured right now, going to GPU dialect will result in two sub-modules within the main module

module attributes {gpu.container_module} {
    // Host side code which will contain gpu.launch_func op
   module @<some_name> attributes {gpu.kernel_module} {
     // Kernel functions ,i.e. FuncOps with the gpu.kernel attribute
  }
}

The functions in the gpu.kernel_module with the gpu.kernel attribute are targeted for SPIR-V lowering. THe lowering will create a new spv.module within the gpu.kernel_module

module attributes {gpu.container_module} {
   // Host side code which will contain gpu.launch_func op
  module @<some_name> attributes {gpu.kernel_module} {
    spv.module {
       ...
    }
     // Kernel functions ,i.e. FuncOps with the gpu.kernel attribute
}

So it would be useful if the vulkan runner can handle such module. Given this I have some follow up below on a previous comment

@joker-eph @antiagainst

having end-to-end codegen story and testing capability seems still important to me.

sounds great to me!

If it's ok, on the next iteration I'll rebase current patch on trunk, update according to the comments, delete c++ unit tests and leave this PR under WIP. After that I can start to work on a separate PR, the PR will implement a pass which:

Translates module into spv.module.

I am not sure we can translate a module into spv.module directly since the module contains both host and device side components. Do you mean extracting spv.module from within a module and running them?

Serializes spv.module into binary form.

Sets serialized module as an attribute. This is similar to GpuKernelToCubinPass does https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertKernelFuncToCubin.cpp

Seems to me, to implement fully working mlir-vulkan-runner also requires one more PR to cover similar functionality with https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertLaunchFuncToCudaCalls.cpp and to update runtime part to be consistent with new passes and mlir-vulkan-runner. I can cover it on the future PRs. What do you think? If you see the better plan, please fix me in this case. Thanks!

Nov 24 '19 07:11 MaheshRavishankar

On Sat, Nov 23, 2019 at 11:45 PM MaheshRavishankar [email protected] wrote:

Just catching up on all the discussion here now. Apologies for not paying more attention here.

As everyone agrees here, having a mlir-vulkan-runner in MLIR core is extremely useful for testing. So thanks again @denis0x0D https://github.com/denis0x0D for taking this up. Just to provide more context, as @antiagainst https://github.com/antiagainst mentioned we have been trying to get all the pieces in place to convert from Linalg ops to GPU dialect to SPIR-V dialect. Some of the changes w.r.t to these will land soon (in a couple of days). I have also been making some fairly significant change to the SPIR-V lowering infrastructure to make it easy to build upon. (these should also land in a couple of days). While these changes don't directly affect the development of a vulkan-runner (the runner should be only be concerned about the spir-v binary generated by the dialect), they do motivate some requirements on the vulkan-runner summarized below.

As structured right now, going to GPU dialect will result in two sub-modules within the main module

module { module attributes {gpu.container_module} { // Host side code which will contain gpu.launch_func op } module @fmul_kernel attributes {gpu.kernel_module} { // Kernel functions ,i.e. FuncOps with the gpu.kernel attribute } }

As far as I can tell, at the moment the host code is in the module enclosing the kernel module, they aren't siblings: https://github.com/tensorflow/mlir/blob/master/g3doc/Dialects/GPU.md#gpulaunch_func

Did I miss anything?

The functions in the gpu.kernel_module with the gpu.kernel attribute are targeted for SPIR-V lowering. THe lowering will create a new spv.module within the gpu.kernel_module

module { module attributes {gpu.container_module} { // Host side code which will contain gpu.launch_func op } module @fmul_kernel attributes {gpu.kernel_module} { spv.module { ... } // Kernel functions ,i.e. FuncOps with the gpu.kernel attribute } }

Why is the spv.module nested inside the kernel_module instead lowering the kernel_module into a spv.module? What do the "FuncOps with the gpu.kernel" attribute becomes in your representation above?

(is there an example I could follow in the repo maybe?)

Thanks,

-- Mehdi

So it would be useful if the vulkan runner can handle such module. Given this I have some follow up below on a previous comment

@joker-eph https://github.com/joker-eph @antiagainst https://github.com/antiagainst

having end-to-end codegen story and testing capability seems still important to me.

sounds great to me!

If it's ok, on the next iteration I'll rebase current patch on trunk, update according to the comments, delete c++ unit tests and leave this PR under WIP. After that I can start to work on a separate PR, the PR will implement a pass which:

Translates module into spv.module.

I am not sure we can translate a module into spv.module directly since the module contains both host and device side components. Do you mean extracting spv.module from within a module and running them?

Serializes spv.module into binary form.

Sets serialized module as an attribute. This is similar to GpuKernelToCubinPass does https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertKernelFuncToCubin.cpp

Seems to me, to implement fully working mlir-vulkan-runner also requires one more PR to cover similar functionality with https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertLaunchFuncToCudaCalls.cpp and to update runtime part to be consistent with new passes and mlir-vulkan-runner. I can cover it on the future PRs. What do you think? If you see the better plan, please fix me in this case. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/mlir/pull/118?email_source=notifications&email_token=AAZXKDFCJRDZ2QY2OE4F3ZTQVIWJRA5CNFSM4IS5DSN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAFU3A#issuecomment-557865580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZXKDGVYKFLG7BEALSKFYDQVIWJRANCNFSM4IS5DSNQ .

Nov 24 '19 07:11 joker-eph

@MaheshRavishankar thanks for the feedback,

I am not sure we can translate a module into spv.module directly since the module contains both host and device side components. Do you mean extracting spv.module from within a module and running them?

I was wrong about translate to spv.module inside this pass, sorry about it, I lost the context of this task :) The lowering part GPU -> SPIR-V is already exists and it should be added to pass manager in mlir-vulkan-runner.

So to enalbe mlir-vulkan-runner it needs:

A pass which enables right after "-convert-gpu-to-spirv", it consumes module with {gpu.container_module} attribute, serializes spv.module into binary form, using existing spirv::Serializer and attaches serialized spv.module as an attribute, simliar to this https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertKernelFuncToCubin.cpp#L85
A pass which instruments host part, the part which contains gpu.launch_func, with calls to runtime wrappers, simliar to this (the API should be discussed):

module {
  func @main() {
     // ...
    call @vkPopulateBuffer(%buff : memref<x?f32>, %desriptorSet : i32, %descriptorBinding : i32, %storageClass : i32) ...
    // ...
    call @vkLaunchShader(...) ...
    // ...
 }
// extern @vkPopulateBuffer
// extern @vkLaunchShader

The host part is lowering to llvm ir (llvm.dialect), the serialized module becomes a llvm.global https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertLaunchFuncToCudaCalls.cpp#L370 with just a binary data, so we can pass a pointer to that data to runtime https://github.com/tensorflow/mlir/pull/118/files#diff-13d6d98d70fab37eda97625d19b84998R685, actually to create a shaderModule we just need to populate VkShaderModuleCreateInfo with size and pointer to binary shader.

So as result after the second pass we get a module which can be run under JitRunner, the pipeline must look similiar to this code https://github.com/tensorflow/mlir/blob/master/tools/mlir-cuda-runner/mlir-cuda-runner.cpp#L112. Thanks!

Nov 24 '19 13:11 denis0x0D

Thanks @denis0x0D . Overall what you are suggesting makes sense to me, but I am not that fluent in Vulkan speak to provide more feedback on specifics. But the interaction with the kernel to SPIR-V binary compilation makes sense.

@joker-eph : Thanks for catching my mistake. Edited my post to reflect the actual layout. The main point i was raising though was that you cannot serialize the entire module to a spv.module.

Nov 24 '19 22:11 MaheshRavishankar

On Sat, Nov 23, 2019 at 11:45 PM MaheshRavishankar @.***> wrote: Just catching up on all the discussion here now. Apologies for not paying more attention here. As everyone agrees here, having a mlir-vulkan-runner in MLIR core is extremely useful for testing. So thanks again @denis0x0D https://github.com/denis0x0D for taking this up. Just to provide more context, as @antiagainst https://github.com/antiagainst mentioned we have been trying to get all the pieces in place to convert from Linalg ops to GPU dialect to SPIR-V dialect. Some of the changes w.r.t to these will land soon (in a couple of days). I have also been making some fairly significant change to the SPIR-V lowering infrastructure to make it easy to build upon. (these should also land in a couple of days). While these changes don't directly affect the development of a vulkan-runner (the runner should be only be concerned about the spir-v binary generated by the dialect), they do motivate some requirements on the vulkan-runner summarized below. 1. As structured right now, going to GPU dialect will result in two sub-modules within the main module module { module attributes {gpu.container_module} { // Host side code which will contain gpu.launch_func op } module @fmul_kernel attributes {gpu.kernel_module} { // Kernel functions ,i.e. FuncOps with the gpu.kernel attribute } } As far as I can tell, at the moment the host code is in the module enclosing the kernel module, they aren't siblings: https://github.com/tensorflow/mlir/blob/master/g3doc/Dialects/GPU.md#gpulaunch_func Did I miss anything? The functions in the gpu.kernel_module with the gpu.kernel attribute are targeted for SPIR-V lowering. THe lowering will create a new spv.module within the gpu.kernel_module module { module attributes {gpu.container_module} { // Host side code which will contain gpu.launch_func op } module @fmul_kernel attributes {gpu.kernel_module} { spv.module { ... } // Kernel functions ,i.e. FuncOps with the gpu.kernel attribute } } Why is the spv.module nested inside the kernel_module instead lowering the kernel_module into a spv.module? What do the "FuncOps with the gpu.kernel" attribute becomes in your representation above? (is there an example I could follow in the repo maybe?) Thanks, … -- Mehdi

The conversion of a FuncOp with gpu.kernel attribute to spv.module predates the existence of kernel_module. It is pretty straight-forward (and would actually clean up the conversion a little bit) to lower a kernel_module to a spv.module. It is on my list of things to do. Will probably get to that in a day or so.

@denis0x0D : I will try to send the change to lower kernel_module to spv.module as a pull request to the mlir github. So that might be something to look out for w.r.t to the vulkan runner. Thanks!

So it would be useful if the vulkan runner can handle such module. Given this I have some follow up below on a previous comment @joker-eph https://github.com/joker-eph @antiagainst https://github.com/antiagainst having end-to-end codegen story and testing capability seems still important to me. sounds great to me! If it's ok, on the next iteration I'll rebase current patch on trunk, update according to the comments, delete c++ unit tests and leave this PR under WIP. After that I can start to work on a separate PR, the PR will implement a pass which: 1. Translates module into spv.module. I am not sure we can translate a module into spv.module directly since the module contains both host and device side components. Do you mean extracting spv.module from within a module and running them? 1. Serializes spv.module into binary form. 2. Sets serialized module as an attribute. This is similar to GpuKernelToCubinPass does https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertKernelFuncToCubin.cpp Seems to me, to implement fully working mlir-vulkan-runner also requires one more PR to cover similar functionality with https://github.com/tensorflow/mlir/blob/master/lib/Conversion/GPUToCUDA/ConvertLaunchFuncToCudaCalls.cpp and to update runtime part to be consistent with new passes and mlir-vulkan-runner. I can cover it on the future PRs. What do you think? If you see the better plan, please fix me in this case. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#118?email_source=notifications&email_token=AAZXKDFCJRDZ2QY2OE4F3ZTQVIWJRA5CNFSM4IS5DSN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAFU3A#issuecomment-557865580>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZXKDGVYKFLG7BEALSKFYDQVIWJRANCNFSM4IS5DSNQ .

Nov 24 '19 22:11 MaheshRavishankar

@MaheshRavishankar

I will try to send the change to lower kernel_module to spv.module as a pull request to the mlir github. So that might be something to look out for w.r.t to the vulkan runner. Thanks!

Sounds great to me! By the way, those convertions looks great https://github.com/tensorflow/mlir/tree/master/test/Conversion/GPUToSPIRV

Nov 25 '19 12:11 denis0x0D

Thannks @MaheshRavishankar for the good catch! I was actually implicitly thinking that way and didn't notice that it is not obvious to everyone. :)

I think we are already creating the spv.module inside the top-level model, so not embedded in gpu.kernel_module: https://github.com/tensorflow/mlir/blob/d342ff9507f952697f537309f73746ae78394bd3/lib/Conversion/GPUToSPIRV/ConvertGPUToSPIRVPass.cpp#L48-L57 But yes going directly from gpu.kernel_module instead of gpu.kernel is cleaner and we can have multiple entry points in the same spv.module! (We need to do more verification along the way too.)

Nov 25 '19 14:11 antiagainst

@antiagainst : In top of tree now it is added just before the FuncOp

https://github.com/tensorflow/mlir/blob/5614db039b28026b5fe38aa96a21c4e6835d352b/lib/Conversion/GPUToSPIRV/ConvertGPUToSPIRVPass.cpp#L57

But the correct way is to convert the entire module into a spv.Module.

@denis0x0D : The conversion from GPU to SPIRV is changed now with 8c152c5771d590fca9c6b64b0bc6e9563175efb8 The CL introduces attributes that you use to specify the ABI for entry point functions. So when you convert from GPU to SPIR-V dialect you add these attributes to the function arguments and function itself that describe various ABI related aspects (like storage class, descriptor_set, binding, etc for spv.globalVariables), as well as the workgroup size for the entry point function. A later pass (just before serialization) will materialize this ABI.

Nov 25 '19 20:11 MaheshRavishankar

@MaheshRavishankar thanks for mention this!

Nov 25 '19 22:11 denis0x0D

Late to the game and just wanted to comment that it would be really awesome if we had a lowering for the host-side of the gpu dialect in the vulkan/SPIRV context, as well. I think by using a C++ library to implement a higher-level API should make the host-side code-generation part relatively straight forward. However, as far as I have been told, it still requires quite some code to write such C++ library.

I would not spend too much energy on the design of the C++ library's API and just build what makes sense for easy code generation. The host side of the dialect will likely evolve a fair bit so these abstractions won't be forever and the mlir-vulkan-runner is not meant to replace a real runtime, so simplicity is more important.

Nov 28 '19 21:11 sherhut

I would not spend too much energy on the design of the C++ library's API and just build what makes sense for easy code generation.

That's the approach that has been taken in the current pull request. You'll need thousands of lines of code to bootstrap a basic Vulkan compute shader. Right now that's all hidden in a single entry point runOnVulkan.

Nov 29 '19 22:11 antiagainst