Plans for a Triton implementation?
Thank you guys for open sourcing this amazing work, I was curious though if there are any plans for a Triton implementation for a higher level implementation. I would like to experiment with this project in tandem with a library I have been working on to accelerate diffusion models but I am not entirely familiar with CUDA yet.
Looking forward to your response 🙂
We may try if we have free time, but it's not in the pipeline at the moment. We welcome community contributions!
What changes are you interested in making that Triton would be helpful for?
Mainly looking for an implementation I can easily play around with, hopefully stuff like bias and activation fusion, extension to 2D, etc. Is there a reference pytorch implementation anywhere I can look at?