wonnx Make it easier to inject custom op

Currently I must modify source code to add support for a custom op. This is quite inconvenient. I think the large match in compile.rs could be abstracted into a trait, by allowing users to implement the trait and register custom ops in a in-app registry, it would be much easier to extend the framework.

Mar 07 '22 08:03 zimond

@zimond great suggestion! I know other runtimes have a way to register custom ops as well. Not sure if @haixuanTao has the time, I am working on some other things at the moment; feel free to send a PR if you feel like doing this yourself. Would also be happy to discuss designs here if you want.

Mar 07 '22 15:03 pixelspark

Another question somewhat related: I want to implement a custom op dcnv2, it has some pre/post process steps around the cuda function. I looked through wonnx code and cannot find a way to easily manipulate input/output. Any suggestions or maybe i missed something?

Mar 08 '22 14:03 zimond

Another question somewhat related: I want to implement a custom op dcnv2, it has some pre/post process steps around the cuda function. I looked through wonnx code and cannot find a way to easily manipulate input/output. Any suggestions or maybe i missed something?

Currently wonnx translates each op into a shader and will run them sequentially. Intermediate data stays in buffers in GPU memory until the very end, where we 'download' the data from output buffers in GPU memory to main memory.

What kind of manipulation would you like to do?

It should be fairly easy to tell wonnx to run only up to a certain point and fetch the output, then feed that into a second network (and you can do the manipulation on CPU in between). We don't really have a facility to implement this at the op level right now.

Mar 08 '22 16:03 pixelspark

I need to run a matmul(input_2, output_0) to get the final output. I doubt if I could change optimizer to produce more than one node in optimized_with, so maybe complex custom ops could be extracted to a combination of several onnx supported ops.

related code here

Mar 08 '22 16:03 zimond

Well, if the matmul only happen at the end of the operation you can try to use a barrier: https://www.w3.org/TR/WGSL/#sync-builtin-functions and paste the matmul shader right below it. You won't have to add another node. In terms of performance, you can even win a bit on saving buffer space.

But I agree that implementing those custom ops is for the moment cumbersome. I'm also on something else at the moment so feel free to implement it :)

Mar 08 '22 17:03 haixuanTao

Another option would be to slightly change the compiler code such that it allows invoking different functions in a shader (with their own thread counts) in sequence. This should be a fairly easy change.

Mar 08 '22 19:03 pixelspark

It seems that you cannot use barriers to force sync storage buffers. In my custom op, I need to fill data to a intermediate buffer, then use the buffer in a mutmul. There's no way to ensure all invocations filling the buffer is done before matmul. So @pixelspark 's suggestion seems to be the only workaround. In which I could split the custom op into two shaders, create two pipelines and two dispatches. If you guys think this makes sense, I could extract my code and submit a PR.

Mar 19 '22 07:03 zimond

@zimond sure, I will be happy to review PRs!

The question is how to implement this without overcomplicating things. My first thought was to change NodeTemplate to accept an (optional) list of entry point names (which would then be called in-order with the same bindings and thread counts) but this might not suit your needs. Another approach would be to return a Vec<NodeTemplate> from the compiler (and call these sequentially - each invocation could then have different shader code and thread counts but still should have the same bindings). If you also need additional intermediate buffers, some more changes would be necessary...

Mar 19 '22 17:03 pixelspark

If you use a singular shader file with multiple entry points, wgpu will complain about not used bindings and so on, which could be confusing, as WGPU requires all bindings to be used in a single entry point. I think the better way would be allowing a node to emit a sequence of CompiledNode, each has its own bindings (via providing an index vector of the node's "global" bindings), shaders and threads.

As an inference, tract has a mechanic named wired, to wire an op to one or several other ops.

About the intermediate buffers, currently I'm hacking optimizer to inject new inputs, which is quite ugly.

Mar 20 '22 03:03 zimond

Hm, good point. In this case perhaps splitting the op into multiple 'internal' ops (each with their own shader, buffer) is better?

Mar 20 '22 16:03 pixelspark

Yes indeed

Mar 21 '22 02:03 zimond

wonnx wonnx copied to clipboard

Make it easier to inject custom op

wonnx
wonnx copied to clipboard