wonnx
wonnx copied to clipboard
Support Stable Diffusion model
Is your feature request related to a problem? Please describe. I would like to be able to run Stable Diffusion using wonnx
Describe the solution you'd like At least, these operators are missing and should be implemented before even trying too run Stable Diffusion on wonnx: Einsum, Erf, Expand, InstanceNormalization, Shape, Slice
This is the minimum based on this guide that simplifies the onnx model (see the simplification table): https://www.photoroom.com/tech/stable-diffusion-25-percent-faster-and-save-seconds/
Probably many more things will be needed, but I'm creating this issue because it can be a really interesting use case to be able to run SD in rust on the GPU directly.
I don't have much experience with wonnx or even ML, but I decided to create this issue because it surprised me how few operators are missing to run this model. I would need to get more experience with stable diffusion, diffusers library and onnx in python before attempting to port it here, but maybe there are more experienced users interested too.
Hello Sirius, thanks for taking interest in wonnx!
The erf function is not yet a native operation on WGSL, see: https://www.w3.org/TR/WGSL/
It will be required to do an approximation of the erf function, to do stable diffusion on wonnx. I am at this point not sure on how to implement this.
Thanks for your answer. Again, let me reiterate my ignorance on this field, but this is what I've found.
The implementation used in tract seems very simple https://github.com/sonos/tract/blob/21928fb3652d028db5be1348e6017494318d4b86/onnx-opl/src/erf.rs
Looking at other WGSL shaders for other operations, it seems translatable.
The signum in WGSL is just sign, abs is the same, powi we can just use pow or even unroll it as it's 16 (and it's short and efficient), recip is just 1/x.
copysign is trickier, but for the erf function should be just a multiplication with the original sign (as erf(0) == 0).
I've looked a little bit to the other missing ops, and they don't seem as straight forward.
I looked into this a few weeks ago - it is a significant chunk of work for 2 reasons:
- The ops to implement are complicated (i.e Einsum)
- WONNX does not currently support parameterized dimensions, which would be required to implement the text encoder.
Thanks for looking at it. I hope one day we can be able to run something like SD in pure Rust.
As a matter of interest, tch-rs recently implemented Stable Diffusion: https://github.com/LaurentMazare/diffusers-rs
It's not directly applicable to this, but it could inform future development efforts.
- WONNX does not currently support parameterized dimensions, which would be required to implement the text encoder.
I am not too familiar with SD but at least for BERT and other text encoders, parameterized dimensions can be replaced with fixed dimensions just fine (the model will then work with text token strings up to the statically set length).
- WONNX does not currently support parameterized dimensions, which would be required to implement the text encoder.
The shape inference engine in WONNX now supports this (it allows you to set parametrized dimensions, then infer shapes for other outputs).
I looked into this a few weeks ago - it is a significant chunk of work for 2 reasons:
- The ops to implement are complicated (i.e Einsum)
- WONNX does not currently support parameterized dimensions, which would be required to implement the text encoder.
As for Einsum: this may be feasible, a first start is in #154