stable-diffusion-webui
stable-diffusion-webui copied to clipboard
Stable Diffusion Meta AITemplate with >= 200% performance increase
AI Template from Meta proposes a 200% or more speedup in image generation.
Presently it is only available for the diffusers library: https://github.com/facebookincubator/AITemplate/tree/main/examples/05_stable_diffusion
PT = pytorch, AIT = AI template implementation
Looking into what is needed for this to work:
We need to isolate all portions of the model for sampling from ldm/taming and create torch-like AIT version of them to be transpiled into c++.
https://facebookincubator.github.io/AITemplate/tutorial/how_to_infer_pt.html
A good example here is of the port of the attention module: https://github.com/facebookincubator/AITemplate/blob/main/examples/05_stable_diffusion/modeling/attention.py
Then we run the compile.py
script to build the library, then inference proceeds as normal.
I wonder if there will be the same challenges implementing this as there were implementing this other performance enhancement: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/576
In the case of that issue, if I understand correctly, the improved method could only run under linux and there wasn't a clear way to cross-compile for windows and so collaborators were kind of stuck waiting for changes upstream to be made. Any idea if this is going to be compatible with windows?
I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?
I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?
You would need code duplication to the AIT syntax and a flag to turn it on for different hardware requirements, yes. It's very frustrating that there is no way to easily use the existing torch code.
I don't think this repo currently uses diffusers, but stumbled upon this PR:
- https://github.com/huggingface/diffusers/pull/532
Which has some comments talking about how it could potentially also make use of AITemplate in a future PR:
- https://github.com/huggingface/diffusers/pull/532#issuecomment-1297645301
-
That being said, integrating support for AITemplate wouldn't be too hard I believe, maybe for a next PR if you think this could be valuable :)
-
- https://github.com/huggingface/diffusers/pull/532#issuecomment-1297658919
-
Regarding AITemplate, I think those numbers were before they integrated xformers
-
- https://github.com/huggingface/diffusers/pull/532#issuecomment-1297723880
-
It's making Aitemplate even faster and more memory efficient
-
AITemplate + xformers combination just dropped:
Done: https://github.com/facebookincubator/AITemplate/pull/74
Originally posted by @antinucleon in https://github.com/facebookincubator/AITemplate/issues/13#issuecomment-1309591220
- https://github.com/facebookincubator/AITemplate/pull/74
Sync to v0.1.1 version
Impact on current examples:
- Stable Diffusion: A100-40GB / CUDA 11.6, 50 steps (ms)
Batch 1
Module AIT v0.1 AIT v0.1.1 v0.1.1 Speedup CLIP 0.87 0.87 1X UNet 22.47 18.11 1.24X VAE 37.43 20.14 1.85X Sum of Three 1161.8 926.51 1.25X Pipeline 1282.98 1013 1.26X v0.1: 42.45 it/s v0.1.1: 53.30 it/s
Batch 16
Module v0.1 v0.1.1 Speedup Pipeline 14931.95 11064.81 1.34X
- BERT CUDA long sequence performance will be significantly boosted by using new mem_eff_attention codegen
- VIT CUDA large resolution performance will be significantly boosted by using new mem_eff_attention codegen
A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range
He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve ... But the results speak for themselves!
Would be very interesting if someone investigated this further and figured out a way to port it to the webui
Here's his specs: RTX 4090 FE (stock settings), WSL, cuda 11.6, latest AItemplate (13-11-2022), Intel 12700KF, Windows 11 22H2
Here's a screenshot he provided me:
Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.
I am getting stable 11 it/s on my 3080..?
On Mon, 14 Nov 2022, 05:35 becausereasons, @.***> wrote:
Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.
— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/1625#issuecomment-1313086309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMP7EHRUAMKRRF3ANBBM7LWIG6RRANCNFSM6AAAAAAQ4R7LWI . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>
A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range
He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve
@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!
Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.
I got about 7.5-8.5 out of the box, which is actually really bad for 4090, but yeah, it's because afaik there's still no support for Lovelace in pytorch and other areas as well - 3090's are probably beating those numbers out of the box
The easiest optimizations you could do would be CUDNN optimization which is just replacing some .dll files in the /venv/lib/site-packages/torch/lib/ - You can find the files by searching 4090 cudnn in a discussion thread here and also on r/StableDiffusion
xformers is fairly simple and straightforward too as auto's webui already supports it out of the box without compiling and all you really need to do is to put --xformers into webui.bat after %PYTHON% launch.py %*
to get it installed
I got little under 20it/s after those two, which isn't as high as some are able to get, but I'm happy enough and will just wait for better 4090 support and things like AIT becoming available for easy Windows installation or included in the webui
Remember to back up everything before attempting!
A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve
@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!
I understand, but I doubt he has something, because the way I understood it, it took him a couple days of struggle and he seems like he's pretty advanced as well in these areas - These kinds of things at this stage tend to have quite varied errors to deal with which are hardware/software configuration specific too
I'll ask anyway!
Btw. I saw hlky's comment stating that "at Stable Horde there are about 40 workers that can test AIT on various GPU's" ... So at least there's some interest out there to gather testing data!
Another potential performance gain issue:
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4721
A few semi-related issues about exploring using AITemplate with Dreambooth:
- https://github.com/facebookincubator/AITemplate/issues/102
- https://github.com/TheLastBen/fast-stable-diffusion/issues/73
- https://github.com/ShivamShrirao/diffusers/issues/32
It is just for inference so won't be helpful in training. I also tested it, it's good for inference but also takes a really long time to compile.
Just FYI - The compilation time with the latest open-source version has been improved a lot from our first release. In our experiences, it can be 4X faster for the models where computation-intensive ops are mostly GEMM-family ops. We've made similar improvement for Conv ops in our internal version, which will be sync-ed to the open-source repo later. Stay tuned. Thanks.
Originally posted by @chenyang78 in https://github.com/facebookincubator/AITemplate/issues/102#issuecomment-1326014540
why not disable to improve the nvidia cards?To kill amd?
why not disable to improve the nvidia cards?To kill amd?
why not disable to improve the nvidia cards?u want to kill amd?
every primpt change need rebuild,it cost above 2mins
Btw
Hi guys! I am newbie of stable diffusion webui. I don't know whether AITemplate is available on stable diffusion webui。Any plan to support it?