stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Feature Request]: Support Torch 2.0 now that it is GA.

Open aifartist opened this issue 2 years ago • 31 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Either switch A1111 to now use torch 2.0.0+cu118 by default or fix one problem I found in trying to use it. Nothing prevents it from being used if someone does a manual install using: pip3 install clean-fid numba numpy torch==2.0.0+cu118 torchvision --force-reinstall --extra-index-url https://download.pytorch.org/whl/cu118 However, if A1111 does a xformers install that results in it getting downgraded again. To allow it to be used properly either find a xformers prebuilt for torch 2.0 or ignore --xformers with a warning if the user has torch 2.0 installed. The warning can suggest to try sdp instead.

Proposed workflow

If torch 2.0.0 is installed and --xformers triggers the xformers install and you can't find a matching xfomers pre-built for 2.0.0+cu117 or 2.0.0+cu118 depending on which is installed then I suggest a warning like: Could not find xformers to match the installed torch version <PRINT ACTUAL __version__>. Ignoring the attempt to install xformers. You can use --opt-sdp-attention instead or build your own xformers.

Additional information

No response

aifartist avatar Mar 17 '23 06:03 aifartist

can also use venv to have versions of both and select whichever fits you

hananbeer avatar Mar 17 '23 13:03 hananbeer

can also use venv to have versions of both and select whichever fits you

Damn! Now I feel stupid. I keep reinstalling back and forth to do experiments or just check things. What an idiot I am. Really. Duh! I've probably wasted hours cumulatively doing this. I should be fired! Luckily I'm retired. :-) ptyhon3 -m env-pt2.0 pwd/env-pt2.0 python3 -m env-pt1.13 pwd/env-pt1.13 or something like that. I could have dozens of combinations.

However, I have thought of creating some fixed venv's outside the home directory of the many different SD app's and versions of those app's I do experiments in with their source code. They take up a lot of disk space.

aifartist avatar Mar 17 '23 21:03 aifartist

You can specify the venv in webui-user.bat

missionfloyd avatar Mar 17 '23 21:03 missionfloyd

Before there is more clarity on this from someone that understands how WebUI works, I wouldn't advise anybody to use PyTorch 2.0. It is not supported yet.

The package xformers is now part of PyTorch 2.0, which means that, to install PyTorch 2.0, xformers should be uninstalled and the argument --xformers should not be used. But then, what ends up happening is that xformers will simply not be used! Using PyTorch 2.0 without xformers is likely less efficient than using PyTorch 1.13.1 with xformers.

What's worse, it's likely that WebUI is not yet taking any advantage of PyTorch 2.0 at all! (It lacks torch.compile(...), for example.) So you just end up with a (potentially buggy) performance of PyTorch 1.13.1.

TernaryM01 avatar Mar 18 '23 09:03 TernaryM01

24.) Automatic1111 Web UI - PC - Free For downgrade to older version if you don't like Torch 2 : first delete venv, let it reinstall, then activate venv and run this command pip install -r "path_of_SD_Extension\requirements.txt" How To Install New DREAMBOOTH & Torch 2 On Automatic1111 Web UI PC For Epic Performance Gains Guide image

this method installs latest cuda dll files too

image

test py

import torch

print(f"torch {torch.__version__}, cuda {torch.version.cuda}, cudnn {torch.backends.cudnn.version()}")

FurkanGozukara avatar Mar 19 '23 10:03 FurkanGozukara

@FurkanGozukara After a ton of hours testing locally and with someone on a 3060ti I can say Pytorch 2 is slower for us. I don't mean by a little either. For them they dropped 2-3it/s and I went from 1:26s to 1:30s for all else being equal. They, and I, reverted back to Pytorch 1 scratching our heads as to why since the claim to fame of PyTorch2 is supposed to be a speed increase. For them they said it even used more vram.

It is still a wild world for this newish AI stuff. With VoltaML I get 90 it/s which uses torch=1.12+cu113 but even switching to 1.13.1+cu117 drops the perf by half(45 it/s). I told them and they then confirmed this but volta has no need to upgrade torch given that TensorRT is so very fast. I have spent some time starting to debug this but have been distracted by other things. Today I figured out why my i9-13900K was only doing 5.5 GHz instead of 5.8 GHz. Why Torch 2.0 compile I can now get: 100%|███████████| 20/20 [00:00<00:00, 51.12it/s] 100%|███████████| 20/20 [00:00<00:00, 51.15it/s] 100%|███████████| 20/20 [00:00<00:00, 51.19it/s] 100%|███████████| 20/20 [00:00<00:00, 51.31it/s] 100%|███████████| 20/20 [00:00<00:00, 51.28it/s]

Not as fast as Volta but to get 51+ in A1111 is exciting.

aifartist avatar Mar 19 '23 23:03 aifartist

@FurkanGozukara After a ton of hours testing locally and with someone on a 3060ti I can say Pytorch 2 is slower for us. I don't mean by a little either. For them they dropped 2-3it/s and I went from 1:26s to 1:30s for all else being equal. They, and I, reverted back to Pytorch 1 scratching our heads as to why since the claim to fame of PyTorch2 is supposed to be a speed increase. For them they said it even used more vram.

It is still a wild world for this newish AI stuff. With VoltaML I get 90 it/s which uses torch=1.12+cu113 but even switching to 1.13.1+cu117 drops the perf by half(45 it/s). I told them and they then confirmed this but volta has no need to upgrade torch given that TensorRT is so very fast. I have spent some time starting to debug this but have been distracted by other things. Today I figured out why my i9-13900K was only doing 5.5 GHz instead of 5.8 GHz. Why Torch 2.0 compile I can now get: 100%|███████████| 20/20 [00:00<00:00, 51.12it/s] 100%|███████████| 20/20 [00:00<00:00, 51.15it/s] 100%|███████████| 20/20 [00:00<00:00, 51.19it/s] 100%|███████████| 20/20 [00:00<00:00, 51.31it/s] 100%|███████████| 20/20 [00:00<00:00, 51.28it/s]

Not as fast as Volta but to get 51+ in A1111 is exciting.

what gpu are you using?

edwardyeung avatar Mar 20 '23 03:03 edwardyeung

@FurkanGozukara After a ton of hours testing locally and with someone on a 3060ti I can say Pytorch 2 is slower for us. I don't mean by a little either. For them they dropped 2-3it/s and I went from 1:26s to 1:30s for all else being equal. They, and I, reverted back to Pytorch 1 scratching our heads as to why since the claim to fame of PyTorch2 is supposed to be a speed increase. For them they said it even used more vram.

None of that is surprising, and it's exactly as I predicted. The reason is simply that the code in Automatic1111's WebUI and StabilityAI's Stable Diffusion repos need to be modified to take advantage of PyTorch 2. With the code unmodified, it's going to be worse than PyTorch 1.13.1, because, among other things, xformers is not used. (PyTorch 2 now comes with xformers built-in, but again, the Python codes need to be modified to enable it.)

I imagine the modification of the codes should be relatively minor (something like simply changing one line to be inside torch.compile(...) with some appropriate parameters). So, unless you know how to do it (and please send the pull request and thank you very much), you better be a little bit more patient and wait until it is implemented.

TernaryM01 avatar Mar 20 '23 11:03 TernaryM01

@aifartist

Why Torch 2.0 compile I can now get: (...) Not as fast as Volta but to get 51+ in A1111 is exciting.

Do you mean you modified Automatic1111's code to use torch.compile(...)? Would you send a pull request?

TernaryM01 avatar Mar 20 '23 11:03 TernaryM01

@FurkanGozukara After a ton of hours testing locally and with someone on a 3060ti I can say Pytorch 2 is slower for us. I don't mean by a little either. For them they dropped 2-3it/s and I went from 1:26s to 1:30s for all else being equal. They, and I, reverted back to Pytorch 1 scratching our heads as to why since the claim to fame of PyTorch2 is supposed to be a speed increase. For them they said it even used more vram.

It is still a wild world for this newish AI stuff. With VoltaML I get 90 it/s which uses torch=1.12+cu113 but even switching to 1.13.1+cu117 drops the perf by half(45 it/s). I told them and they then confirmed this but volta has no need to upgrade torch given that TensorRT is so very fast. I have spent some time starting to debug this but have been distracted by other things. Today I figured out why my i9-13900K was only doing 5.5 GHz instead of 5.8 GHz. Why Torch 2.0 compile I can now get: 100%|███████████| 20/20 [00:00<00:00, 51.12it/s] 100%|███████████| 20/20 [00:00<00:00, 51.15it/s] 100%|███████████| 20/20 [00:00<00:00, 51.19it/s] 100%|███████████| 20/20 [00:00<00:00, 51.31it/s] 100%|███████████| 20/20 [00:00<00:00, 51.28it/s] Not as fast as Volta but to get 51+ in A1111 is exciting.

what gpu are you using?

  1. When it runs near 100% busy and efficiently(xformers or sdp) + torch.compile you need a i9-13900K level of processor to get the most from it with single image generations(batchsize=1)

aifartist avatar Mar 20 '23 16:03 aifartist

I don't get it. How are people getting these MASSIVE speeds anyways. And is CPU really a bottleneck for this stuff as well? I know so little! I have a 3090 and a 2950X from AMD and I am getting (with 512x768, 30 steps, DPM+ SDE Karras) about 3-6 it/s !!!!!

oliverban avatar Mar 21 '23 12:03 oliverban

I don't get it. How are people getting these MASSIVE speeds anyways. And is CPU really a bottleneck for this stuff as well? I know so little! I have a 3090 and a 2950X from AMD and I am getting (with 512x768, 30 steps, DPM+ SDE Karras) about 3-6 it/s !!!!!

Number one: Are you running cuDNN v8.7 or higher? Number two: Are you using xformers or sdp? Even though your 4.4 GHz CPU isn't that fast, for 512x768 with a 3090, you should be able to do much better. I hope you are quoting batchsize=1 generations.

aifartist avatar Mar 21 '23 19:03 aifartist

I don't get it. How are people getting these MASSIVE speeds anyways. And is CPU really a bottleneck for this stuff as well? I know so little! I have a 3090 and a 2950X from AMD and I am getting (with 512x768, 30 steps, DPM+ SDE Karras) about 3-6 it/s !!!!!

Number one: Are you running cuDNN v8.7 or higher? Number two: Are you using xformers or sdp? Even though your 4.4 GHz CPU isn't that fast, for 512x768 with a 3090, you should be able to do much better. I hope you are quoting batchsize=1 generations.

  1. Yes, I am. Or I think I am. I did copy the new cuDNN 8.7+ dlls into "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\ folder". Is it the same as cuda compiler? When doing nvcc --version I get Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Any way of explicitly checking cuDNN?

  1. Xformers only. SDP I have seen around but haven't tried it yet.

Yes a batch size of 1.

Friend on their 3060ti for 1.5 512x512 gens at 10.15it/s consistently, with sometimes a dip down into the 9.x it/s.

Edited for the card they actually use.

Yeah, when doing gens with default 1.5, no LORAS or TI, just Euler A 512x512 I get around 15 it/s, which is why I specified my options because it's those that make the quality what it is, Euler A and the rest give sub-par results which is why I never use them (almost never).

oliverban avatar Mar 22 '23 12:03 oliverban

Oh, I forgot to mention DDIM, and 20 steps for them. Your it/s sound way too low and I highly suspect something is wrecked somewhere when a 3060ti is outdoing you around 3 to 1.

Well on DDIM, 20 steps and 512x512 on the default 1.5SD, with a fairly OK positive prompt and with 3 negative prompts, I am getting around 11.12it/s as an average of 7 runs, which still sounds low then. I don't know what could be wrong though.

EDIT: Lol, just saw down at the bottom that my version says "torch: 1.13.1+cu117" so that would seem it's using the old cu and not the one I see when doing "nvcc --version" in CMD? WHY?!?! :O

Found this archive: Download 8.8 gonna do anything? https://developer.nvidia.com/rdp/cudnn-archive

oliverban avatar Mar 22 '23 14:03 oliverban

Number one: Are you running cuDNN v8.7 or higher? Number two: Are you using xformers or sdp? Even though your 4.4 GHz CPU isn't that fast, for 512x768 with a 3090, you should be able to do much better. I hope you are quoting batchsize=1 generations.

  1. Yes, I am. Or I think I am. I did copy the new cuDNN 8.7+ dlls into "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\ folder". Is it the same as cuda compiler? When doing nvcc --version I get Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Yes, that is your system level install of the cuda toolkit. Given what I believe is called the RPATH search path shouldn't you have instead replaced the copies in: <A1DIR>\venv\python3.10\site-packages\torch\{lib or dll???} I'm not a windows guy so I don't know whether it uses a lib, dll or bin directory under torch. But just search under site-packages to see if there are a set of libcudnn* libraries there and replace those.

aifartist avatar Mar 22 '23 16:03 aifartist

I'm seeing an interesting issue.. I'm on Win11 Pro, with a i9-13900K and a 4090. When i do a batch size of 1, I get somewhere between 18-25 it/s for a single image.. When I increase that a batch size of 2 it goes down to about 8 it/s and if I increase to batch of 8 it's 2-3 it/s. But I do a batch count of 8 with a batch size of 1, I get the 18-25 it/s range. It's a little unclear if I should be seeing closer to 50 it/s a second (I would really love clarity on that).

screenshot

dustyatx avatar Mar 23 '23 01:03 dustyatx

@dustyatx I personal hate the it/s crap. Yes, it is just another metric like any other which varies based on several factors.

At one point I see 89 steps done in 4 seconds. That is an ok number, indicating you are probably using cuDNN >= v8.7. While an OK number you should be twice as fast with your hardware. With batchsize=2 you need to multiple by 2 to get the "effective" it/s which would be 16. That number doesn't sound right because with cuDNN 8.7 the optimal batchsize for throughput is about 2 or 3. So the effective number should have been better than your batchsize=1 run. In other words, greater than 20 and not less than 20.

HOWEVER, I don't see an 8 it/s in your output. I see a 19 which, if that is your batchsize==2 example, is effectively 38. With the optimal batchsize the "Windows" perf penalty isn't as pronounced.

Batchsize=8 with cuDNN >= v8.7 is no longer the optimal as may have been the case with cuDNN 8.5. 2 it/s seems too low but since it is SUB-optimal there's no reason to figure out why because you shouldn't use that.

I'm not sure why your number of steps are all over the place. If you are trying to do a controlled test varying the batchsize leave the number of steps alone.

aifartist avatar Mar 23 '23 02:03 aifartist

Number one: Are you running cuDNN v8.7 or higher? Number two: Are you using xformers or sdp? Even though your 4.4 GHz CPU isn't that fast, for 512x768 with a 3090, you should be able to do much better. I hope you are quoting batchsize=1 generations.

  1. Yes, I am. Or I think I am. I did copy the new cuDNN 8.7+ dlls into "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\ folder". Is it the same as cuda compiler? When doing nvcc --version I get Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Yes, that is your system level install of the cuda toolkit. Given what I believe is called the RPATH search path shouldn't you have instead replaced the copies in: <A1DIR>\venv\python3.10\site-packages\torch\{lib or dll???} I'm not a windows guy so I don't know whether it uses a lib, dll or bin directory under torch. But just search under site-packages to see if there are a set of libcudnn* libraries there and replace those.

Well that doesn't make sense then. It should say cu118 right? For 11.8? I've never installed 11.7. I've downloaded the new cuDNN 8.8 and replaced the DLL's in BOTH the paths, the one I wrote and the one you wrote. Indeed I had also already replaced them in /venv/ as well with the older 8.7. Still seeing the same it/s. 11-ish.

Is there any way I can install new/update the environment from the folder with cmd line?

oliverban avatar Mar 23 '23 09:03 oliverban

@aifartist thank you for all of that great information, very informative. I'm a scaled systems & solutions designer/architect (includes ML/AI, etc), so my hands on knowledge tends to be shallow in many places..

I personal hate the it/s crap. Yes, it is just another metric like any other which varies based on several factors.

I see what you're saying now that I have a better understanding. I assumed that a batch was being done in serial not parallel, but now that I think about it that way it makes total sense. I can see how this output would confuse a lot of people, maybe it would be better to have a different one that handles the mental math you have to do to read it.

While an OK number you should be twice as fast with your hardware.

Would love any guidance on what to do to make it faster. I don't normally use windows but I switched over after I realized this approach didn't support Linux (missing whl builds?). Huge pain to remove massive amount of windows bloat but NTLite did a great job as it always has.

Ultimately I have two projects I'm looking to accomplish that I think will be time intensive and I want to make sure I'm properly optimized before I start. I'm working on restoring a bunch of old family videos, ran nice test in CoLab and that seems to work well.. The other is I'm working with Riffusion (SD for Audio) and I have hundreds of thousands of spectrograms that I want to train on (still trying to figure out how to do training).

I'm not sure why your number of steps are all over the place. If you are trying to do a controlled test varying the batchsize leave the number of steps alone.

I wasn't doing a systematic test, it was doing some feeling around spot testing. The screenshot was only a small section of what I had been doing over about 30 mins. Happy to run a systematic one if you can help me understand how I should go about testing.

At one point I see 89 steps done in 4 seconds. That is an ok number, indicating you are probably using cuDNN >= v8.7.

I didn't specially install cudnn, not sure what point in the installation that is installed but here are some version info. Driver Version: 531.29, CUDA Version: 12.1, torch 2.0.0+cu118, cuda 11.8, cudnn 8700

Batchsize=8 with cuDNN >= v8.7 is no longer the optimal as may have been the case with cuDNN 8.5.

This is a very interesting statement, there is obviously some nuance here that I am missing. I had the naive impression that batch size was directly correlated to RAM size, so 24GB would give me the largest batch size at 512x512 and that might have to go down as resolution went up. For future testing, is there a good way to identify the batch size sweet spot?

dustyatx avatar Mar 23 '23 13:03 dustyatx

@dustyatx What doesn't work on Linux? I'm on Ubuntu. SD, get great performance and all torch versions work for me.

I assumed that a batch was being done in serial not parallel, but now that I think about it that way it makes total sense. I can see how this output would confuse a lot of people, maybe it would be better to have a different one that handles the mental math you have to do to read it.

batchcount=8 batchsize=1 and batchount=1 batchsize=8 both produce 8 images. The first is serial and the second is parallel and the second way needs to have the it/s adjusted to make perf comparisons.

Batchsize=8 with cuDNN >= v8.7 is no longer the optimal as may have been the case with cuDNN 8.5.

This is a very interesting statement, there is obviously some nuance here that I am missing. I had the naive impression that batch size was directly correlated to RAM size, so 24GB would give me the largest batch size at 512x512 and that might have to go down as resolution went up. For future testing, is there a good way to identify the batch size sweet spot?

The only relation there is between batchsize and VRAM size is the size of the largest batch you can use. How many images in a batch can fit in memory at the same time is kind of obvious. I have run batchsize=100 once for an experiment and it worked on my 4090. Is there some other relation you meant? Having said that I don't mean you should actually use the largest batch size that can fit. Aggregate performance in terms of num_images/time gets worse once you pass the optimal batchsize. With cuDNN 8.5 I only got 13.5 it/s using batchsize=1. But with batchsize 15 or 16 I got the best aggregate perf of 40(?) or so. It has been awhile. HOWEVER, with cuDNN 8.7 I got 39.5 with serial single image gen's. Because the GPU was now at closer to 100% utilization the optimal batchsize became only 2 or 3 where I got smaller gains. I was so happy for 39.5 serial that I didn't even bother to do the trivial task of testing exactly which was better and by how much. Besides, I have since gotten serial up above 41 or 43 which other work I've done. NOTE: For those with slower CPU's, that only get 20 to 30 it/s for serial, the optimal batchsize is probably higher but that is your task to find out and not mine. I'm tired of people saying that have a cpu half the speed of mine and wondering why they get half the performance. The CPU needs to rapidly deliver work to the GPU to keep it busy. For longer running operations like a large batch or a large image that becomes less of an issue. But for 512x512 it matters a lot.

For future testing, is there a good way to identify the batch size sweet spot?

The test can be simple. Set steps=50, batchsize=4, use 512x512 and any prompt with like 10 words(certainly not like 75+). I use euler_a and other things don't really matter. With batchsize 4 ignore the it/s on the first one and average the remaining 3. Ignore the total line at the end. The sd2.1 model gets about 2 it/s more than any other model I've seen like sd1.5. NOTE: On Windows performance is far less stable than I see on Linux. So you might use a larger batchsize for averaging the results. This is what you should use for reporting purposes so we can see if you have things setup correctly. After that you need to test with your favorite image size and perhaps img2img if that is your workload.


Are you getting 8 it/s or 19 it/s for batchsize=2? Did you copy the cuDNN 8.7 libraries into your cuda 11.8 system directory or into your venv/.../torch/lib directory which will be seen "FIRST" by the RPATH library search? If the v8.5 libraries are still there from when torch was installed they be used instead of the others.

aifartist avatar Mar 23 '23 18:03 aifartist

@aifartist I ended up throwing in the towel.. In 25+ years of computing. I've never had a software installation experience as painful as getting all the various CUDA related software working properly..

Right now I have a working version of Automatic1111 with xformers and I'm just going to have to leave it at that..

I tried every major branch of Linux, multiple different OS versions, every installation process I could find.. It's shocking how bad of a UX this is.. No idea why Nvidia doesn't see this as the massive problem that it truly is.

dustyny avatar Apr 02 '23 16:04 dustyny

Are you getting 8 it/s or 19 it/s for batchsize=2?

I shall repeat one of the questions: Are you getting 8 it/s or 19 it/s for batchsize=2? However, I hate trying to solve problem number 2 when problem number one needs fixing. Keep it simple, stay focused on one problem at a time(there can be exceptions to that) and don't flood me with the output showing a snippet of a large number of test permutations where I don't know what is what. They don't help but to confuse me unless I spend way too much time on one of many problems way too many people are having.

If you have indeed tried Linux, and I don't mean some pile of crap WSL, it takes me all of 42 seconds to do:

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui a1111
cd a1111
python3 -m venv `pwd`/venv
source venv/bin/activate
./webui.sh --opt-sdp-attention --opt-channels-last

and several more seconds to upgrade to torch==2.0.0 and easily get 39.5 it/s on my i9-13900K + 4090

aifartist avatar Apr 02 '23 18:04 aifartist

Torch 2.0 makes no diff in my case. 4070 Ti, 7it/sec before and after. 512, 40 steps, fp16, no xformers.

VictorZakharov avatar Apr 09 '23 20:04 VictorZakharov

@FurkanGozukara After a ton of hours testing locally and with someone on a 3060ti I can say Pytorch 2 is slower for us. I don't mean by a little either. For them they dropped 2-3it/s and I went from 1:26s to 1:30s for all else being equal. They, and I, reverted back to Pytorch 1 scratching our heads as to why since the claim to fame of PyTorch2 is supposed to be a speed increase. For them they said it even used more vram.

absolutely correct, tested using vladmandic/automatic and it's slower and uses more VRAM Can anyone explain the delusional graphs in the article done by the pytorch devs? Especially the 4090 charts? I cannot replicate ANY of those findings. What UI were they using? https://pytorch.org/blog/accelerated-diffusers-pt-20/

razvan-nicolae avatar Apr 14 '23 23:04 razvan-nicolae

Torch 2.0 has been the default for quite some time now. Closing. https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/9191

catboxanon avatar Aug 07 '23 15:08 catboxanon