onnx-mlir
onnx-mlir copied to clipboard
Naive cuda support
Will onnx-mlir support using naive cuda kernel to write operator kernels?
Hello,
Let me first indicate our intentions, and then we can discuss how your use may fit into this project.
In this project, our ONNX dialect goals is 2 folds.
- Adding ONNX operators to MLIR, with the reading of the protobuf models, and performing high level support such as shape inference.
- Providing high level optimizations on ONNX operators that are probably beneficial to all settings (e.g. constant folding)
In addition, we are providing a reference lowering that use KRNL IR, which we are experimenting with to provide the right abstraction to lower/optimize the ONNX dialect. The principal target for these optimizations is to enable optimizations across "kernels" or high level operators. At the present time, our primary focus is generating of highly optimized code for CPUs, but this is should not be limited to that.
Now, if you wanted to simply transform some ONNX operations and steer them toward CUDNN type calls, you can do so by adding rules from ONNX to a CUDA dialect representing CUDNN type calls. If, instead, you are interested in native cuda code, where you perform many optimizations, then probably using KRNL dialect to transform the code to parallelism suitable to CUDA, then translating KRNL to affine/loop, and piggybacking on dialects for GPUs would be a good approach. We can work with you if there are specific KRNL labels/operations that may simplify your task.
Thank you for your response. I'm a newer for mlir. We want to reuse the cuda code without rewritten them using KRNL dialect. Is that possible for onnx-mlir to support both KRNL dialect and hand-wriiten cuda code ?
Yes, it is possible; you will need to write some rules on how to change ONNX operation (e.g. such as convolution) to operations that represent the CUDA DNN equivalent call.
hand-wriiten cuda code
how is it packaged? I assume in a lib call?
@feiyuvl for hand written CUDA code, it might be easier to have a transformation from krnl to CUDA due to the fact that the number of operations in KRNL is much much smaller than the number of ONNX operations.
If you're just going to call a CUDA DNN function then you can do what @AlexandreEichenberger suggested above.
Any updates on this work?
@doru1004 I wanna do the same work from krnl to AMD GPU. So are there any examples or demos about it. Because I don’t know how to start.
@AlexandreEichenberger what is the current status for CUDA support? Is it possible to perform inference on GPU using the PyRuntime API? Thanks!
IBM folks are not currently working on any GPU accelerators in ONNX-MLIR and given our focus, I don't anticipate any internal needs in the near/medium timeframe, As an open source project we are dedicated to incorporate work from team adding support directly into our repo provided they commit resources to develop their approach.
To provide GPU support, there are several ways, some of which are very recent and might be quite promising.
- One is to use library calls for the ops, an approach that we did to support our IBM Telum Integrated AI Accelerator (known as NNPA for its instruction set in this project). The accelerator approach is modular and one can add a new one using the same template as used for NNPA). For this one, the approach is fairly well understood and easy to deploy as we already have a template.
- We are currently working with two different groups that are adding ONNX to TOSA (which is a dialect that MLIR can natively handle,) as well as ONNX to MHLO (which is a MLIR HLO Tensorflow representation of DNN graphs). Using one of these approaches, you may piggyback GPU support via TOSA or MHLO within their respective environment. Both conversions are work in progress but the more folks are interested in one approach, the more coders are avail to make it work well.
- Another approach is to either generate directly code from either ONNX or Krnl dialect to generate code pattern suitable to GPUs. We don't currently have any suggested approach for that. Currently Krnl lowering is geared toward CPU. But one could enable different lowering passes that are specialized to GPUs. Ideally, one would do a scheme that can encapsulate what is common among GPUs of different manufacturer to minimize work duplication.
- Or if you can work from MLIR Affine, then you may export the Affine representation into MLIR and work from there. We have a few Krnl ops which we lower during Affine to LLVM, so there might be some minor complications to that approach, probably something that can be overcome.
My suggestion is that you fork our project and investigate one or more of these paths for a few ops and then make a RFC so that we may jointly evaluate the proposed approach and take it from there.
Is it possible to perform inference on GPU using the PyRuntime API?
OyRuntime API is simply there to invoke a model that was compiled. PyRuntime is very very lean, all it does is create tensors to stuff the input data to, and process lists. It is not used for 99% of the ONNX operations. We have a few (maybe 3 or so) ONNX/KRNL operations that requires the runtime support, mainly to create a hash and/or print code. I assume that for a GPU, one may have to do additional support for these later cases if there are operations need hashes / print on the GPU.