stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Feature Request]: Enable direct-ml for this stable-diffusion-webui.
Is there an existing issue for this?
- [x] I have searched the existing issues and checked the recent builds/commits
What would your feature do ?
Enable direct-ml for stable-diffusion-webui, enabling usage of intel/amd GPU in windows system. This protocol is already tested, a pull request will be submit soon.
Proposed workflow
With pytorch-directml 1.13, we could add this feature with minimal code change. All we need is to modify get_optimal_device_name (in devices.py), and add
if has_dml():
return "dml"
dml could not be refrenced by name, so you should also modify get_optimal_device (also in devices.py), adding
if get_optimal_device_name() == "dml"
import torch_directml
return torch_directml.device()
and modify sd_models.py to avoid using "dml" as string, change line from
device = map_location or shared.weight_load_location or devices.get_optimal_device_name()
to
device = map_location or shared.weight_load_location or devices.get_optimal_device()
finally, add a dml workaround to devices.py:
# DML workaround
if has_dml():
orig_cumsum = torch.cumsum
orig_Tensor_cumsum = torch.Tensor.cumsum
torch.cumsum = lambda input, *args, **kwargs: ( orig_cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
torch.Tensor.cumsum = lambda self, *args, **kwargs: ( orig_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )
you could define has_dml() wherever suits your need.
To install enviorment:
conda create -n stable_diffusion_directml python=3.10
conda activate stable_diffusion_directml
conda install pytorch=1.13.1 cpuonly -c pytorch
pip install torch-directml==0.1.13.1.dev230119 gfpgan clip
pip install git+https://github.com/mlfoundations/open_clip.git@bb6e834e9c70d9c27d0dc3ecedeebeaeb1ffad6b
# Launch to clone packages including requirements
python .\launch.py --skip-torch-cuda-test --lowvram --precision full --no-half
# Install requirements
pip install -r repositories\CodeFormer\requirements.txt
pip install -r requirements.txt
# Start
python .\launch.py --skip-torch-cuda-test --lowvram --precision full --no-half
Here are examples
Additional information
No response
Interesting. Everything just works in webui after those changes? I might play with this over the weekend then.
conda
Don't tie it to conda and make people install that. Just use a native Python venv.
@vt-idiot
Everything just works in webui after those changes?
- Torch-directml is basically torch-cpuonly with a torch_directml.device() to use directx gpu as device.
- I only changed the "optimal_device" in webui to return dml device, so most cacluation is done on directx gpu, but a few packages detecting device themselves will still use cpu.
conda
- This is just my own testing scripts, I have not modify launch.py to generate enviorment automatically with pip yet (so I did not post a pull request, only an issue).
By the way, direct-ml is RAM-costing on integrated gpu, at least 16GB is required for sd1.4 on 512*512. On dedicated gpu, direct-ml also costs more RAM than cuda or rocm. This is only a solution for AMD/Intel gpu on windows.
you could define has_dml() wherever suits your need.
What should I write in has_dml()?
By the way, direct-ml is RAM-costing on integrated gpu, at least 16GB is required for sd1.4 on 512*512. On dedicated gpu, direct-ml also costs more RAM than cuda or rocm. This is only a solution for AMD/Intel gpu on windows.
Are we talking RAM or VRAM ? I have more than enough RAM but "only" 8 GB VRAM which seems to be to little to train embeddings. If you actually only need RAM for integrated graphic cards and it´s still a performance boost over CPU only then I wonder if this also works with dedicated GPU
By the way, direct-ml is RAM-costing on integrated gpu, at least 16GB is required for sd1.4 on 512*512. On dedicated gpu, direct-ml also costs more RAM than cuda or rocm. This is only a solution for AMD/Intel gpu on windows.
Are we talking RAM or VRAM ? I have more than enough RAM but "only" 8 GB VRAM which seems to be to little to train embeddings. If you actually only need RAM for integrated graphic cards and it´s still a performance boost over CPU only then I wonder if this also works with dedicated GPU
@majorsauce Integrated gpu use system RAM as VRAM, in this case, directml could cause double system RAM usage than CPU training. On dedicated gpus, direct-ml also costs a little more VRAM than cuda or rocm.
thank you,I have solved the problem according to your way
@majorsauce Integrated gpu use system RAM as VRAM, in this case, directml could cause double system RAM usage than CPU training. On dedicated gpus, direct-ml also costs a little more VRAM than cuda or rocm.
@simonlsp so my current issue is I only have a dedicated GPU and no integrated one but the 8 GB VRAM is not enough to train embeddings atm. Do you know if it‘s possible for the dedicated GPU to utilize RAM instead of VRAM (which obviously will be a lot slower) to do the training ? Or would it come to the same performance as simply training CPU only ?
you could define has_dml() wherever suits your need.
Can you elaborate on this point?
EDIT: Alright, for anyone else confused about this:
Add
def has_dml() -> bool:
return has_dml
to devices.py, note that this will force the WebUI to run in DML mode so it's not a full fix.
My results: TLDR: Significantly slower than ROCm, but it atleast runs on Windows For reference, I have a 6800XT. Additionally, all testing has been done with Stable Diffusion v1.4, no LoRAs and the Euler A sampler.
Running with ./launch.py --skip-torch-cuda-test --lowvram --precision full --no-half
, I get about 2.4s/it, using about 2GB of VRAM.
On ROCm, I get about 8 it/s. Huge speed downgrade.
Running with ./launch.py --skip-torch-cuda-test or with --lowvram --precision
full gives a RuntimeError: mat1 and mat2 must have the same dtype
error when attempting generation, so at least for now we're forced to stick with the slower --no-half mode. Iteration time is similar to previous test. On ROCm, I still get about 8 it/s.
Running with ./launch.py --skip-torch-cuda-test --no-half mode
is interesting: it uses about 6GB to load the model, then rapidly consumes all available VRAM once generation begins. We reach a much higher speed - about 3 it/s at highest - but immediately run into GPU memory allocation errors if we try to do more than one generation. On ROCm, I still get about 8 it/s.
I'm not sure why it's consuming so much VRAM. I think it may be an issue with DirectML not freeing memory, but I'm not experienced enough to tell.
New update!:
python .\launch.py --skip-torch-cuda-test --no-half --precision full --opt-sub-quad-attention
appears to be the best command for now, curtsey of @lshqqytiger in issue #3756. This appears to fix the out of memory error by limiting how much memory can be used, although WebUI still consumes most available VRAM. Additionally, I still get 3 it/s, even running with --no-half --precision full
.
adding onto the post above, 6700xt about 1.5it/s using the windows stuff but 6it/s using rocm in linux so significant downgrade. To do this, I cloned Ishqqytiger's fork and added the def has_dml() and --no-half --precision full --opt-sub-quad-attention. Skip torch cuda test seems to be outdated in the fork or something
I've tested it on my 7900 XTX and it's faster than ONNX for me, but it's slower than MLIR (the SHARK repo uses that). Depending on the sampler I get 3-4 it/s and I've noticed that not all samplers are working. Some are throwing exceptions and some are hanging at 0%.
In my fork, I removed --skip-torch-cuda-test
and added fallback to DirectML when MPS and CUDA unavailable.
I tested and fixed all samplers a few weeks ago, but it might have been broken by some updates. I'll check again.
(Did you cloned original repos of crowsonkb/k-diffusion
and Stability-AI/stablediffusion
? Then retry with lshqqytiger/k-diffusion-directml
and lshqqytiger/stablediffusion-directml
. (I already setup git submodule for it) I fixed some samplers on these repos)
thank you so much for releasing this issue. It works very well on my Radeon RX 570, with --lowvram
.
Unrelated to this specific repository, but can we also use Direct-ml for using AMD GPUs with OpenAI Whisper?
DirectML can be applied to any projects using pytorch or tensorflow. (But some features does not work if they are not implemented yet)
my laptop have 2 gpu. intel iris xe (IGPU) and intel xe max (DGPU), can DirectML use both of my gpu at the same time? system spec: i5 1135g7, 16 gb ddr4, 512 gb NVME ssd
You can just select which gpu will be use.