deps: upgrade to PyTorch 2.0 (replaces xformers)
It provides a direct interface to several optimized implementations of scaled dot-product attention so we don't need to explicitly depend on xformers or triton anymore.
Fixes #2405
I did some quick and dirty tests here on Linux/CUDA (RTX 3060) and it seems to work in this environment.
To Do
- [x] test on Windows
- [ ] test on MPS
- [x] test on ROCm
- [ ] figure out what to do with our
_adjust_memory_efficient_attentionmethod. Is it entirely obsolete now that pytorch has a C++ cross-platform implementation of scaled dot product attention to fall back on, or will we still need that? - [ ] provide some interface to
torch.backends.cudnn.deterministicas per notes on avoiding nondeterministic algorithms.
So this triples my speeds on a 4090.
- Update to pytorch 2.0.0 and remove xformers: ~11it/s --> ~22it/s
- Do not call
_adjust_memory_efficient_attention: ~22it/s --> ~33it/s
Unfortunately, even with a 24GB VRAM card, I get OOM errors when decoding large images - like around 1600 x 1600 and up. At some point in the past, I think before diffusers, I could do over 2048 x 2048.
However! With #2920, I can now generate absolutely gargantuan images. There are some artifacts due to the tiling, but a 3072 x 3072 uses under 8GB VRAM.
Happy camper here. Thanks!
Can confirm what @psychedelicious reported. I get nearly 3x speeds on an RTX 3080 laptop GPU on Windows too. But I have to note that this speed boost keeps depreciating as I keep generating more and more images.
Further testing. I installed xformers 0.0.17 dev and let the _adjust_memory_efficient_attention along with Torch 2. The generation speeds are even better. I went up from 2it/s to 6it/s now for a 512x768 image.
Putting some thoughts and testing results here in this PR.
With some brief testing, you get that performance boost but also non-deterministic behavior. None of the options available (subject to change, according to the docs) allow us to reproduce images made with pre-2.0 pytorch, however it looks like we may be able to get determinism with:
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_math_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(False)
I think this makes it not quite as fast as what you have above. I'd love other people to try that out in place of _adjust_memory_efficient_attention. I'm also curious what diffusers will do w/rt pytorch 2.0.
We also need to look into whether we can disable memory efficient attention for non-CUDA or if we have to leave that code alone.
Do we need to micromanage each possible implementation like that, or is it sufficient to use torch.backends.cudnn.deteriministic = True? https://pytorch.org/docs/master/notes/randomness.html#cuda-convolution-determinism
Do we need to micromanage each possible implementation like that, or is it sufficient to use
torch.backends.cudnn.deteriministic = True? https://pytorch.org/docs/master/notes/randomness.html#cuda-convolution-determinism
Maybe we can get away with that or torch.use_deterministic_algorithms(True). I don't know what that does from a performance perspective but I can toy around with it. And if it is deterministic after setting either of those, then we have to see what effect that has on memory and attention slicing. I'll investigate when I can.
Here's what I get:
RuntimeError: Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or `at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
After setting determinism on with torch.use_deterministic_algorithms(True) and doing as the error message suggests, images generated in succession are identical to each other, and image generation times look to be maybe a bit faster. The far better times come at the expense of reproducibility.
It looks like we can potentially yank out the attention slicing code but that causes images to be different between torch 2.0 and pre-torch 2.0.
On the plus side, even with those deterministic algorithms in use, I can now generate really large images until I get to the decoding step.
looks like the current anchor for that section of the cuBLAS docs is a bit different: https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility
Having to set the environment variable is a bit awkward. Can you tell if it's something that needs to be set before the library is initialized, or can we (re)set it on the fly at runtime? Wouldn't be so bad if we could do it that way.
Maybe somewhere in CLI.py (or another place/places), we do: os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"? As long as that hits before torch and friends load up, I imagine that will work.
If we cannot change these settings on the fly, can we explore wrapping the initialization of torch so we can re-do that if these settings are changed, without having to fully quit the application?
Can confirm what @psychedelicious reported. I get nearly 3x speeds on an RTX 3080 laptop GPU on Windows too. But I have to note that this speed boost keeps depreciating as I keep generating more and more images.
Why would the speed increase decline? Is there a memory leak?
Tested on a ROCm system:
good news: Renders a nearly identical "banana sushi" to 1.13. Differences are subtle and about the same as generation-to-generation variances with xformers on a CUDA system. No variation from one image to the next when using 2.0.0 repeatedly.
disappointing news: No improvement in rendering speed
expected news: On the AMD GPU that I use, there is a 60s "warmup period" before rendering starts the very first time torch is called. After this, there is no delay, even when invokeai is killed and restarted. This is the same behavior I observed previously in 1.13 and it was fixed by recompiling pytorch from source.
I tested in a CUDA system (NVIDIA RTX A2000 12GB) just now and the performance of 1.13+xformers is equal to 2.0.0 without xformers. No 3x speedup in my hands, unfortunately!
@lstein did you comment out the call to _adjust_memory_efficient_attention? Doing that was half the 200% improvement
Maybe somewhere in
CLI.py(or another place/places), we do:os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"? As long as that hits before torch and friends load up, I imagine that will work.
It is a bit tricky to use environment variables to configure python libraries since the environment variable needs to be set before the first import torch statement. In the current code, we are already doing this at the top of CLI.py (this was edited a bit for clarity):
import os
import re
import shlex
import sys
import traceback
from argparse import Namespace
from pathlib import Path
from typing import Union
[more non-torch imports]
if sys.platform == "darwin":
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
from ...backend import Generate, ModelManager
[more backend imports]
If we need to set a bunch of environment variables, then I would suggest that we make a new .py file with all the environment setting code in it. Alternatively we change the code execution so that the command-line arguments are parsed early on; this would enable us to make the environment variable settings a feature of the invokeai.init file.
Something similar has to be done for api_app.py and cli_app.py for nodes.
@lstein did you comment out the call to
_adjust_memory_efficient_attention? Doing that was half the 200% improvement
It's already commented out in the PR.
@lstein did you comment out the call to
_adjust_memory_efficient_attention? Doing that was half the 200% improvementIt's already commented out in the PR.
I also tried running with 2.0.0 and xformers 0.0.17rc482, with and without _adjust_memory_efficient_attention and am not seeing any effect on rendering speeds. This is all with 512x512 images and stable-diffusion-1.5. Are the improvements more dramatic with larger images?
I also tried running with 2.0.0 and
xformers 0.0.17rc482, with and without_adjust_memory_efficient_attentionand am not seeing any effect on rendering speeds. This is all with 512x512 images and stable-diffusion-1.5. Are the improvements more dramatic with larger images?
For me, the improvements were basically at the same across all resolutions (about 3x faster). I did not notice any degradation in performance over time, but that probably need more careful testing - just going from memory here.
I think you've tried all of the permutations then - maybe the improvements are related to the new cu118? Could be it has improvements for only certain platforms (A2000 looks to be "ampere" while eg 3000/4000 series I think are "lovelace").
I'll run through some tests and see if the degradation persists. But in either case, I think upgrading to Pytorch 2 is a no brainer once we have all the roadblocks resolved.
There's been no activity on the PR for several days. Seems to me we should just go ahead with this?
There's been no activity on the PR for several days. Seems to me we should just go ahead with this?
I thought you disliked non-deterministic behavior? I think we also need to resolve whether we should keep the slicing code as per the comments above for the case where the user wants determinism - which we can do but we need to address. IMO this is not ready for prime time and we should lock things in at torch~=1.13.1 until such time as we figure it out.
Is anyone still working on this? Otherwise it's going to get left behind.
I've been using 2.0.0 since it released. But I am also using xformers together with it because I get much faster results. There's obviously determinism issues but those exist with xformers too.
So maybe we keep the attention on for now and upgrade to 2.0 and also enforce xformers to the latest version so it is compatible with the new torch.
Can we add a new flag to disable the slicing? I'd rather get the massive speed boost than have deterministic results most of the time.
@psychedelicious @lstein We need to do more than that just to get deterministic behavior - and we should have the option to do so. See above. All of this makes me uncomfortable that we'll lose reproducibility; isn't that important for the audit trail we want to have for results?
@lstein - I believe we've addressed this concern w/ latest xformers update. Good to close?