stable-diffusion-webui-forge 🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility

I have requested from the developers of xformers, A precompiled version of xFormers that is compatible with CUDA 12.4 and PyTorch 2.4. https://github.com/facebookresearch/xformers/issues/1079

They have compiled precompiled wheels for CUDA 12.4 and PyTorch 2.4 https://github.com/facebookresearch/xformers/actions/runs/10559887009

Now you can fully add xformers to the fresh Forge

Aug 28 '24 04:08 sashaok123

(Viruses on board!)

How can I ban a user or send a complaint to the administrators?

Aug 28 '24 05:08 sashaok123

How can I ban a user or send a complaint to the administrators?

You can report inauthentic account activity to GitHub Support by visiting the account home page, where there is a Block or report option under the account avatar

EDIT:By the way,this account was been deleted

Aug 28 '24 06:08 wuliaodexiaoluo

https://app.any.run/tasks/abb4419a-a8cb-4707-946d-e73a9d3561bb Usual Lumma stealer... I don't know if you get notifications for every message in an issue so: @lllyasviel bad files x.x

Aug 28 '24 07:08 sais-github

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

Aug 28 '24 12:08 dongxiat

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

Aug 28 '24 12:08 sais-github

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

Aug 28 '24 20:08 HMRMike

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

Yes! Did some benchmarks on my RTX 3070 with Flux Q8, 28steps, euler, simple, 1024x1024: Forge with CUDA 12.1 + Pytorch 2.3.1: 3.61s/it Forge with CUDA 12.4 + Pytorch 2.4: 3.05s/it (15% faster) Forge with CUDA 12.4 + Pytorch 2.4 + xformers: 2.85s/it (21% faster)

Aug 28 '24 20:08 adrianschubek

wow

Aug 30 '24 15:08 yamfun

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config

using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

Sep 02 '24 17:09 l33tx0

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config

using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

Sep 02 '24 19:09 HMRMike

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

when i start i got an error like this pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

Sep 02 '24 22:09 l33tx0

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec. Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

when i start i got an error like this pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

Yeah the Triton thing is only for Linux, apparently. It's not a real issue on Windows, you can ignore this message. https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/7115

I'm getting almost exactly the same speed with SDXL (it fluctuates, up to 3.6, but effectively identical to yours) and these settings, so it leaves something weird with Flux. Just to make sure, I'm using the Q8 model from here: https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main Otherwise all the versions seem identical, we get the same startup stuff. Even without xformers it should be quite a bit faster. Just as a sanity check in such cases I like to just git clone a fresh copy and see if there are any differences, maybe erase the VENV folder and let stuff rebuild if the fresh copy was indeed faster. Makes hunting for a specific issue less frustrating.

Sep 03 '24 00:09 HMRMike

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

Can you explain how to install the wheel in Forge without a venv (the cuda12.4 / pytorch2.4 .zip on main page)? I know it uses embedded python and sets the paths via environment.bat, but I still can't get pip to work.

EDIT: I think I figured it out, it's the same with ComfyUI's embedded python.

The embedded python.exe is in system\python\python.exe then you just add -m pip install after the .exe

You can laugh at me now.

Nov 13 '24 21:11 Hujikuio

Hey guys, it looks like the files are down. I’ve been stuck on this for days and only just found this thread, but sadly the links are now expired.

Does anyone have the prebuilt xformers or a safe reupload?

Thanks.

Edit : Here's a more detailed post about my problem. Edit2: check the link above to find the solutions

May 25 '25 12:05 Shippdenn