stable-diffusion-webui upgrading torch, torchvision, xformers (windows), to use cu117

🤔 What does this PR do?

Upgrades torch to 1.13.1+cu117
Upgrades torchvision to 0.14.1+cu117
Upgrades xformers to 0.0.16rc404 (windows).

🗒️ Notes:

The triton error is normal on windows and you can just ignore it.

You can get an additional performance boost by copying over (replacing the old) .dll files from: cudnn-windows-x86_64-8.6.0.163_cuda11-archive\bin to: stable-diffusion-webui\venv\Lib\site-packages\torch\lib (or building manually with cuda 11.8)

Hopefully we can get official torch binaries built with cuda 11.8 soon.

closes https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5901

Dec 22 '22 22:12 petalas

Use half precision, it boost your performance around 50%

Dec 22 '22 23:12 Nyaster

Use half precision, it boost your performance around 50%

I have tried removing the --no-half and --precision full args, it barely makes a difference in my case, ~16.6 vs ~17 If full precision helps accuracy I'd rather keep it on (can't tell if it makes any difference tbh).

Unless you're talking about something else that I'm missing? Definitely nowhere near 50% from these 2

Dec 23 '22 00:12 petalas

The error is purely because you used xformers build for another version of cuda. If you rebuild xformers yourself with cuda11.7, it will work fine. I am using it for months. If you want to look for compatible xformer wheels, see https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/5865 What you did is just force updating the cudnn version, which is not updatable via pip anyways. The only way to update it is to use a torch wheel compiled with newer cuda, which you have correctly suggested to use cu116 instead of the current cu113.

Cu117 has some tensor device placement issue for some people, while cu116 don't have. See https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3958#issuecomment-1309806993

Dec 23 '22 09:12 aliencaocao

Use half precision, it boost your performance around 50%

For a 4090, it is the exact same, UNTIL pytorch implements Hopper's Transformer Engine. See https://www.techpowerup.com/gpu-specs/geforce-rtx-4090.c3889 (FP16 and FP32 are both 82.58 TFLOPS)

Dec 23 '22 09:12 aliencaocao

@aliencaocao I've updated the PR, now upgrading fully to cu117 using xformers 0.0.16rc395, this all seems compatible but performance is still not great.

@ninele7 mentioned maybe building with cuda 11.8 which I might attempt when I have more, maybe someone who knows more about all this could jump in and get it done sooner 🤷‍♂️

Dec 23 '22 12:12 petalas

https://github.com/ninele7/xfromers_builds/pull/1#issuecomment-1363609348

I have been using xformers on cuda 11.8 since this repo existed, with cu117 torch packages. No issues so far.

Dec 23 '22 13:12 aliencaocao

ninele7/xfromers_builds#1 (comment)

I have been using xformers on cuda 11.8 since this repo existed, with cu117 torch packages. No issues so far.

@aliencaocao good to know, thanks, are you building xformers yourself?

I'm looking for a release that we could just install with pip, to update this PR.

I thought 0.0.16rc395 would be it but I'm either getting No module named 'triton' (maybe that's fine) or Torch not compiled with CUDA enabled. depending on if I install with --no-deps or not.

Dec 23 '22 13:12 petalas

Yes i am building myself, but only for 0.14.0 as I am facing some errors when building 0.15.0 and newer. Currently using the official 0.16 wheels build on 11.7. The triton error exists on the official wheel also, and is normal if using on windows, since windows does not support triton. This error was not there in 0.14.0 because they only added triton related functions in 0.15.0

Dec 23 '22 14:12 aliencaocao

Yes i am building myself, but only for 0.14.0 as I am facing some errors when building 0.15.0 and newer. Currently using the official 0.16 wheels build on 11.7. The triton error exists on the official wheel also, and is normal if using on windows, since windows does not support triton. This error was not there in 0.14.0 because they only added triton related functions in 0.15.0

@aliencaocao so ignoring the triton error, everything is using cu117 in this branch, any idea why it's so slow? (now only getting ~10 it/s with a 4090).

I was hoping by upgrading everything to cu117 we wouldn't have to manually copy .dlls over etc

(Trying to figure this out so we can avoid having to build xformers manually.)

Dec 23 '22 14:12 petalas

I dont feel its slow. I am on 3080ti.

Dec 23 '22 14:12 aliencaocao

I dont feel its slow. I am on 3080ti.

I mean it's all relative but knowing 25+ is possible 11 it/s is slow imo, basically not getting the full potential of the card. It seems we need cuda 11.8+ and potentially other torch improvements, might just have to wait, but just seeing what's easily upgradable.

If you don't mind testing it at some point I'd be curious to see what performance you get on this branch vs your current setup / manual build of xformers.

Dec 23 '22 14:12 petalas

My current setup IS your branch lol. I do ML myself so I already have torch+cu117 installed before this repo even existed. I use torch 1.13.1+cu117 though.

Dec 23 '22 14:12 aliencaocao

¬aliencaocao

Yes i am building myself, but only for 0.14.0 as I am facing some errors when building 0.15.0 and newer. Currently using the official 0.16 wheels build on 11.7. The triton error exists on the official wheel also, and is normal if using on windows, since windows does not support triton. This error was not there in 0.14.0 because they only added triton related functions in 0.15.0

So you have no issues using xformers 0.14 and 0.16 (pre-release) on cu 117? Also, how did you build xformer 0.14? I guess you followed this https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2103, but when doing it, it does build xformers 0.15 by default. Did you use different parameters?

My current setup IS your branch lol. I do ML myself so I already have torch+cu117 installed before this repo even existed. I use torch 1.13.1+cu117 though.

When you build your own xformers, do you use torch 1.13.1+cu117 and torchvision-0.14.0+cu117?

Dec 23 '22 14:12 Panchovix

My current setup IS your branch lol. I do ML myself so I already have torch+cu117 installed before this repo even existed. I use torch 1.13.1+cu117 though.

Sorry, I meant clean install of this branch without copying things over from cuda 11.8.

No worries, guess we're just waiting for an official 11.8 build?

Dec 23 '22 14:12 petalas

So you have no issues using xformers 0.14 and 0.16 (pre-release) on cu 117?

Yes.

it does build xformers 0.15 by default.

All you have to do is to git checkout a older commit if you want previous versions.

I built when 0.15.0 is not even out, so no issue for me. When I built, it is with 1.13.0+cu117 since 1.13.1 was not out.

Dec 23 '22 14:12 aliencaocao

@aliencaocao Thanks!

May I know which commit are you using for xformers 0.14, if isn't much an issue? Also, gonna do some tests with cu117 + built pre-release 0.16 xformers.

Really appreciated for all the info.

Dec 23 '22 14:12 Panchovix

The commit i cannot remember already as it was the latest commit when i built. I didn't have to checkout to a older commit for 0.14.0 at the time.

Dec 23 '22 14:12 aliencaocao

Yes i am building myself, but only for 0.14.0 as I am facing some errors when building 0.15.0 and newer. Currently using the official 0.16 wheels build on 11.7. The triton error exists on the official wheel also, and is normal if using on windows, since windows does not support triton. This error was not there in 0.14.0 because they only added triton related functions in 0.15.0

I need to rebuild later today to be sure, but the latest git version of xformers builds for me with cuda/torch 11.8.

The only thing tripping me up was that the build fails on newer versions of gcc, or (seemingly) any version of clang.

I will find out if everything works on cuda 12 once arch updates to that version...

Dec 23 '22 17:12 brucethemoose

@C43H66N12O12S2 hey mate, first of all thanks for your xformers build and other contributions, just thought you might want to take a look at this.

I think your build is actually faster (but not compatible with the cu117 binaries).

What was it you did exactly if you don't mind walking me through it? Only thing I can see on your fork is that you deleted the .github directory, what was the workflow that produced xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl?

Could we maybe do the same for a new version based on https://pypi.org/project/xformers/0.0.16rc396/? Or even better work with them to include your optimizations in the next 'official' rc?

Also a torch wheel built with cuda 11.8 would be great, mentioned it here just in case, maybe something you could help with?

Dec 24 '22 01:12 petalas

@petalas is xformers 0.0.16rc396 giving better performance for 4090 users than previous versions?

Dec 24 '22 02:12 DrewWalkup

@petalas is xformers 0.0.16rc396 giving better performance for 4090 users than previous versions?

@dnwalkup please see my previous comment, short answer currently no (just slightly worse) but with additional optimizations possibly.

(Not sure what C43H66N12O12S2 did for xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl, maybe we can do the same for 0.0.16rc396)

Dec 24 '22 02:12 petalas

Only thing I can see on your fork is that you deleted the .github directory, what was the workflow that produced xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl?

Deleting the .github does nothing.

The build process is documented in the wiki

Dec 24 '22 02:12 aliencaocao

don't want to upgrade without reason because people who already have would have to download a very large package, which is a problem for many

Dec 24 '22 05:12 AUTOMATIC1111

There is a reason because newer cuda versions bring performance improvements and bug fixes, as well as support for RTX 4000, and allows for xformers to be updated too, since xformers now have official wheel which is only compiled on cu117 torch

Dec 24 '22 05:12 aliencaocao

Yeah ^. Again I am using system packages atm, and CUDA 11.8 packages + my xformers build is performing slightly better than the python venv on my 2060 laptop.

Dec 24 '22 05:12 brucethemoose

@petalas BTW, you should link to https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5901 by adding closes https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5901 in your original post

Dec 24 '22 06:12 aliencaocao

I have successfully built xformers latest master (https://github.com/facebookresearch/xformers/commit/e163309908ed7a76847ce46c79b238b49fd7d341) on torch 2.0.0.dev20221223+cu117 (latest Torch 2.0 dev on 23 Dec) with CUDA 11.8 install on my system. For anyone who would like to try out and see if there is performance improvements (esp with torch 2.0), you can download it here: https://1drv.ms/u/s!AvJPuRJUdWx_8hbpWdFpr234H5e_?e=eScCID You will need to install this version of pytorch using pip3 install numpy --pre torch==2.0.0.dev20221223+cu117 torchvision torchaudio --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

I am running a RTX 3080Ti on Windows 10, and I am seeing notable speed improvements just by drop-in replacing torch 2.0 and the new xformers built with it, from 7.5it/s to 8.1it/s.

NOTE: although the version of my wheel display as 0.0.15+e163309.d20221224, it is actually newer than all the current 0.0.16dev wheels. The 0.0.16 ones are just folks at xformers modifying the version ahead of build temporarily, while I did not do this.

Dec 24 '22 08:12 aliencaocao

@petalas Hey all. My build doesn't have any specific optimizations, and optimizations seemed to make little difference when I tested (a long time ago now). For example, enabling NVCC LTO & and misc. optimizations made only %3 difference.

What you should do is use a CU 11.8 or 12 NVCC and the newest MSVC (which is used by NVCC as the actual compiler) to build xformers.

I've not checked the official xformers builds, but the reason 4090 doesn't benefit from them might be due to them lacking any SASS code for SM89 (again, no idea if there is actually such a lack).

I'm uncertain if Torch 2.0 has added "support" for SM89, but 1.12 and 1.13 would error when trying to build for SM89. To workaround this issue, you'll need to modify cpp_extension.py inside PyTorch to allow building against SM89. Then set the TORCH_CUDA_ARCH_LIST env variable to target architectures. For me, that was Pascal to Hopper (SM60 to SM90).

Sidenote: The Transformer Engine or FP8 is unlikely to receive PyTorch support in the short-term. IIRC the Meta Torch team still lacks any Hopper cards, and even then, are likely to wait for IEEE 754 specifications.

Dec 24 '22 08:12 C43H66N12O12S2

don't want to upgrade without reason because people who already have would have to download a very large package, which is a problem for many

@AUTOMATIC1111 no worries happy to leave this open, just putting everything together for whenever you're ready to upgrade. But also no one is really forced to upgrade, they could stick with an older commit if they chose to? Anyway, no rush :)

@C43H66N12O12S2 thanks for the info, trying to avoid building manually, might try to create a github workflow for it but it's not really my area of expertise.

Dec 24 '22 12:12 petalas

Only thing I can see on your fork is that you deleted the .github directory, what was the workflow that produced xformers-0.0.14.dev0-cp310-cp310-win_amd64.whl?

Deleting the .github does nothing.

The build process is documented in the wiki

@aliencaocao I know, I've seen it, was just hoping for an actual github workflow file, guessing there was one in there at some point and then the commit got amended?

Like I said the whole reason for this is to avoid having to build things manually..

Dec 24 '22 12:12 petalas