stable-diffusion-webui Stable Diffusion Meta AITemplate with >= 200% performance increase

AI Template from Meta proposes a 200% or more speedup in image generation.

Presently it is only available for the diffusers library: https://github.com/facebookincubator/AITemplate/tree/main/examples/05_stable_diffusion

Screenshot_2022-10-04_09-12-03

PT = pytorch, AIT = AI template implementation

Oct 04 '22 13:10 AmericanPresidentJimmyCarter

Looking into what is needed for this to work:

We need to isolate all portions of the model for sampling from ldm/taming and create torch-like AIT version of them to be transpiled into c++.

https://facebookincubator.github.io/AITemplate/tutorial/how_to_infer_pt.html

A good example here is of the port of the attention module: https://github.com/facebookincubator/AITemplate/blob/main/examples/05_stable_diffusion/modeling/attention.py

Then we run the compile.py script to build the library, then inference proceeds as normal.

Oct 04 '22 16:10 AmericanPresidentJimmyCarter

I wonder if there will be the same challenges implementing this as there were implementing this other performance enhancement: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/576

In the case of that issue, if I understand correctly, the improved method could only run under linux and there wasn't a clear way to cross-compile for windows and so collaborators were kind of stuck waiting for changes upstream to be made. Any idea if this is going to be compatible with windows?

I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?

Oct 04 '22 19:10 JustMaier

I think another potential challenge is that AIT's hardware requirements (Ampere etc). Certainly if this could be implemented it could be behind a cmd opt, but would that mean that there may need to be multiple version of the core items you mentioned?

You would need code duplication to the AIT syntax and a flag to turn it on for different hardware requirements, yes. It's very frustrating that there is no way to easily use the existing torch code.

Oct 04 '22 19:10 AmericanPresidentJimmyCarter

I don't think this repo currently uses diffusers, but stumbled upon this PR:

https://github.com/huggingface/diffusers/pull/532

Which has some comments talking about how it could potentially also make use of AITemplate in a future PR:

https://github.com/huggingface/diffusers/pull/532#issuecomment-1297645301
- That being said, integrating support for AITemplate wouldn't be too hard I believe, maybe for a next PR if you think this could be valuable :)
https://github.com/huggingface/diffusers/pull/532#issuecomment-1297658919
- Regarding AITemplate, I think those numbers were before they integrated xformers
https://github.com/huggingface/diffusers/pull/532#issuecomment-1297723880
- It's making Aitemplate even faster and more memory efficient

Oct 31 '22 23:10 0xdevalias

AITemplate + xformers combination just dropped:

Done: https://github.com/facebookincubator/AITemplate/pull/74

Originally posted by @antinucleon in https://github.com/facebookincubator/AITemplate/issues/13#issuecomment-1309591220

https://github.com/facebookincubator/AITemplate/pull/74

Sync to v0.1.1 version

Impact on current examples:

Stable Diffusion: A100-40GB / CUDA 11.6, 50 steps (ms)

Batch 1

Module AIT v0.1 AIT v0.1.1 v0.1.1 Speedup CLIP 0.87 0.87 1X UNet 22.47 18.11 1.24X VAE 37.43 20.14 1.85X Sum of Three 1161.8 926.51 1.25X Pipeline 1282.98 1013 1.26X v0.1: 42.45 it/s v0.1.1: 53.30 it/s

Batch 16

Module v0.1 v0.1.1 Speedup Pipeline 14931.95 11064.81 1.34X

BERT CUDA long sequence performance will be significantly boosted by using new mem_eff_attention codegen

VIT CUDA large resolution performance will be significantly boosted by using new mem_eff_attention codegen

Nov 10 '22 05:11 0xdevalias

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range

He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve ... But the results speak for themselves!

Would be very interesting if someone investigated this further and figured out a way to port it to the webui

Here's his specs: RTX 4090 FE (stock settings), WSL, cuda 11.6, latest AItemplate (13-11-2022), Intel 12700KF, Windows 11 22H2

Here's a screenshot he provided me:

Nov 13 '22 17:11 YourFriendlyNeighborhoodMONKE

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

Nov 14 '22 04:11 bbecausereasonss

I am getting stable 11 it/s on my 3080..?

On Mon, 14 Nov 2022, 05:35 becausereasons, @.***> wrote:

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

— Reply to this email directly, view it on GitHub https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/1625#issuecomment-1313086309, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMP7EHRUAMKRRF3ANBBM7LWIG6RRANCNFSM6AAAAAAQ4R7LWI . You are receiving this because you are subscribed to this thread.Message ID: @.*** com>

Nov 14 '22 05:11 Maximus-CZ

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range

He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve

@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!

Nov 14 '22 08:11 0xdevalias

Even 25-28 it/s is insane. I'm lucky to get 6-8 on my 4090.

I got about 7.5-8.5 out of the box, which is actually really bad for 4090, but yeah, it's because afaik there's still no support for Lovelace in pytorch and other areas as well - 3090's are probably beating those numbers out of the box

The easiest optimizations you could do would be CUDNN optimization which is just replacing some .dll files in the /venv/lib/site-packages/torch/lib/ - You can find the files by searching 4090 cudnn in a discussion thread here and also on r/StableDiffusion

xformers is fairly simple and straightforward too as auto's webui already supports it out of the box without compiling and all you really need to do is to put --xformers into webui.bat after %PYTHON% launch.py %* to get it installed

I got little under 20it/s after those two, which isn't as high as some are able to get, but I'm happy enough and will just wait for better 4090 support and things like AIT becoming available for easy Windows installation or included in the webui

Remember to back up everything before attempting!

Nov 14 '22 15:11 YourFriendlyNeighborhoodMONKE

A friend of mine was able to get his RTX 4090 inference speed from 25-28 it/s to 61-64 it/s range He said it was a rather painstaking process to get AITemplate work with a lot of errors along the way that he had solve

@YourFriendlyNeighborhoodMONKE Curious if your friend took notes along the way about how they managed to get things going, what issues they ran into, how they solved them, etc. Would be awesome knowledge to have shared here to make others lives easier, and/or bootstrap possibly getting it running for this repo!

I understand, but I doubt he has something, because the way I understood it, it took him a couple days of struggle and he seems like he's pretty advanced as well in these areas - These kinds of things at this stage tend to have quite varied errors to deal with which are hardware/software configuration specific too

I'll ask anyway!

Btw. I saw hlky's comment stating that "at Stable Horde there are about 40 workers that can test AIT on various GPU's" ... So at least there's some interest out there to gather testing data!

Nov 14 '22 15:11 YourFriendlyNeighborhoodMONKE

Another potential performance gain issue:

https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4721

Nov 14 '22 23:11 0xdevalias

A few semi-related issues about exploring using AITemplate with Dreambooth:

https://github.com/facebookincubator/AITemplate/issues/102
https://github.com/TheLastBen/fast-stable-diffusion/issues/73
https://github.com/ShivamShrirao/diffusers/issues/32

Nov 24 '22 06:11 0xdevalias

It is just for inference so won't be helpful in training. I also tested it, it's good for inference but also takes a really long time to compile.

Just FYI - The compilation time with the latest open-source version has been improved a lot from our first release. In our experiences, it can be 4X faster for the models where computation-intensive ops are mostly GEMM-family ops. We've made similar improvement for Conv ops in our internal version, which will be sync-ed to the open-source repo later. Stay tuned. Thanks.

Originally posted by @chenyang78 in https://github.com/facebookincubator/AITemplate/issues/102#issuecomment-1326014540

Nov 24 '22 08:11 0xdevalias

why not disable to improve the nvidia cards？To kill amd?

Sep 13 '23 16:09 Boom-Hacker

why not disable to improve the nvidia cards？To kill amd?

why not disable to improve the nvidia cards？u want to kill amd?

Sep 13 '23 23:09 Boom-Hacker

every primpt change need rebuild,it cost above 2mins

Sep 15 '23 10:09 Boom-Hacker

Btw

Hi guys! I am newbie of stable diffusion webui. I don't know whether AITemplate is available on stable diffusion webui。Any plan to support it?

Feb 06 '24 08:02 bigmover

stable-diffusion-webui stable-diffusion-webui copied to clipboard

Stable Diffusion Meta AITemplate with >= 200% performance increase

stable-diffusion-webui
stable-diffusion-webui copied to clipboard