wonnx icon indicating copy to clipboard operation
wonnx copied to clipboard

Make it easier to inject custom op

Open zimond opened this issue 2 years ago • 11 comments

Currently I must modify source code to add support for a custom op. This is quite inconvenient. I think the large match in compile.rs could be abstracted into a trait, by allowing users to implement the trait and register custom ops in a in-app registry, it would be much easier to extend the framework.

zimond avatar Mar 07 '22 08:03 zimond

@zimond great suggestion! I know other runtimes have a way to register custom ops as well. Not sure if @haixuanTao has the time, I am working on some other things at the moment; feel free to send a PR if you feel like doing this yourself. Would also be happy to discuss designs here if you want.

pixelspark avatar Mar 07 '22 15:03 pixelspark

Another question somewhat related: I want to implement a custom op dcnv2, it has some pre/post process steps around the cuda function. I looked through wonnx code and cannot find a way to easily manipulate input/output. Any suggestions or maybe i missed something?

zimond avatar Mar 08 '22 14:03 zimond

Another question somewhat related: I want to implement a custom op dcnv2, it has some pre/post process steps around the cuda function. I looked through wonnx code and cannot find a way to easily manipulate input/output. Any suggestions or maybe i missed something?

Currently wonnx translates each op into a shader and will run them sequentially. Intermediate data stays in buffers in GPU memory until the very end, where we 'download' the data from output buffers in GPU memory to main memory.

What kind of manipulation would you like to do?

It should be fairly easy to tell wonnx to run only up to a certain point and fetch the output, then feed that into a second network (and you can do the manipulation on CPU in between). We don't really have a facility to implement this at the op level right now.

pixelspark avatar Mar 08 '22 16:03 pixelspark

I need to run a matmul(input_2, output_0) to get the final output. I doubt if I could change optimizer to produce more than one node in optimized_with, so maybe complex custom ops could be extracted to a combination of several onnx supported ops.

related code here

zimond avatar Mar 08 '22 16:03 zimond

Well, if the matmul only happen at the end of the operation you can try to use a barrier: https://www.w3.org/TR/WGSL/#sync-builtin-functions and paste the matmul shader right below it. You won't have to add another node. In terms of performance, you can even win a bit on saving buffer space.

But I agree that implementing those custom ops is for the moment cumbersome. I'm also on something else at the moment so feel free to implement it :)

haixuanTao avatar Mar 08 '22 17:03 haixuanTao

Another option would be to slightly change the compiler code such that it allows invoking different functions in a shader (with their own thread counts) in sequence. This should be a fairly easy change.

pixelspark avatar Mar 08 '22 19:03 pixelspark

It seems that you cannot use barriers to force sync storage buffers. In my custom op, I need to fill data to a intermediate buffer, then use the buffer in a mutmul. There's no way to ensure all invocations filling the buffer is done before matmul. So @pixelspark 's suggestion seems to be the only workaround. In which I could split the custom op into two shaders, create two pipelines and two dispatches. If you guys think this makes sense, I could extract my code and submit a PR.

zimond avatar Mar 19 '22 07:03 zimond

@zimond sure, I will be happy to review PRs!

The question is how to implement this without overcomplicating things. My first thought was to change NodeTemplate to accept an (optional) list of entry point names (which would then be called in-order with the same bindings and thread counts) but this might not suit your needs. Another approach would be to return a Vec<NodeTemplate> from the compiler (and call these sequentially - each invocation could then have different shader code and thread counts but still should have the same bindings). If you also need additional intermediate buffers, some more changes would be necessary...

pixelspark avatar Mar 19 '22 17:03 pixelspark

If you use a singular shader file with multiple entry points, wgpu will complain about not used bindings and so on, which could be confusing, as WGPU requires all bindings to be used in a single entry point. I think the better way would be allowing a node to emit a sequence of CompiledNode, each has its own bindings (via providing an index vector of the node's "global" bindings), shaders and threads.

As an inference, tract has a mechanic named wired, to wire an op to one or several other ops.

About the intermediate buffers, currently I'm hacking optimizer to inject new inputs, which is quite ugly.

zimond avatar Mar 20 '22 03:03 zimond

Hm, good point. In this case perhaps splitting the op into multiple 'internal' ops (each with their own shader, buffer) is better?

pixelspark avatar Mar 20 '22 16:03 pixelspark

Yes indeed

zimond avatar Mar 21 '22 02:03 zimond