stable-diffusion-webui [Feature Request]: Enable direct-ml for this stable-diffusion-webui.

Is there an existing issue for this?

[x] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Enable direct-ml for stable-diffusion-webui, enabling usage of intel/amd GPU in windows system. This protocol is already tested, a pull request will be submit soon.

Proposed workflow

With pytorch-directml 1.13, we could add this feature with minimal code change. All we need is to modify get_optimal_device_name (in devices.py), and add

if has_dml():
    return "dml"

dml could not be refrenced by name, so you should also modify get_optimal_device (also in devices.py), adding

if get_optimal_device_name() == "dml"
    import torch_directml
    return torch_directml.device()

and modify sd_models.py to avoid using "dml" as string, change line from

device = map_location or shared.weight_load_location or devices.get_optimal_device_name()

to

device = map_location or shared.weight_load_location or devices.get_optimal_device()

finally, add a dml workaround to devices.py:

# DML workaround
if has_dml():
    orig_cumsum = torch.cumsum
    orig_Tensor_cumsum = torch.Tensor.cumsum
    torch.cumsum = lambda input, *args, **kwargs: ( orig_cumsum(input.to("cpu"), *args, **kwargs).to(input.device) )
    torch.Tensor.cumsum = lambda self, *args, **kwargs: ( orig_cumsum(self.to("cpu"), *args, **kwargs).to(self.device) )

you could define has_dml() wherever suits your need.

To install enviorment:

conda create -n stable_diffusion_directml python=3.10
conda activate stable_diffusion_directml
conda install pytorch=1.13.1 cpuonly -c pytorch
pip install torch-directml==0.1.13.1.dev230119 gfpgan clip
pip install git+https://github.com/mlfoundations/open_clip.git@bb6e834e9c70d9c27d0dc3ecedeebeaeb1ffad6b
# Launch to clone packages including requirements
python .\launch.py --skip-torch-cuda-test --lowvram --precision full --no-half
# Install requirements
pip install -r repositories\CodeFormer\requirements.txt
pip install -r requirements.txt
# Start
python .\launch.py --skip-torch-cuda-test --lowvram --precision full --no-half

Here are examples Sample1 Sample2

Additional information

No response

Feb 06 '23 16:02 simonlsp

Interesting. Everything just works in webui after those changes? I might play with this over the weekend then.

conda

Don't tie it to conda and make people install that. Just use a native Python venv.

Feb 09 '23 19:02 vt-idiot

@vt-idiot

Everything just works in webui after those changes?

Torch-directml is basically torch-cpuonly with a torch_directml.device() to use directx gpu as device.
I only changed the "optimal_device" in webui to return dml device, so most cacluation is done on directx gpu, but a few packages detecting device themselves will still use cpu.

conda

This is just my own testing scripts, I have not modify launch.py to generate enviorment automatically with pip yet (so I did not post a pull request, only an issue).

By the way, direct-ml is RAM-costing on integrated gpu, at least 16GB is required for sd1.4 on 512*512. On dedicated gpu, direct-ml also costs more RAM than cuda or rocm. This is only a solution for AMD/Intel gpu on windows.

Feb 10 '23 03:02 simonlsp

you could define has_dml() wherever suits your need.

What should I write in has_dml()?

Feb 10 '23 06:02 cyatarow

By the way, direct-ml is RAM-costing on integrated gpu, at least 16GB is required for sd1.4 on 512*512. On dedicated gpu, direct-ml also costs more RAM than cuda or rocm. This is only a solution for AMD/Intel gpu on windows.

Are we talking RAM or VRAM ? I have more than enough RAM but "only" 8 GB VRAM which seems to be to little to train embeddings. If you actually only need RAM for integrated graphic cards and it´s still a performance boost over CPU only then I wonder if this also works with dedicated GPU

Feb 10 '23 22:02 majorsauce

By the way, direct-ml is RAM-costing on integrated gpu, at least 16GB is required for sd1.4 on 512*512. On dedicated gpu, direct-ml also costs more RAM than cuda or rocm. This is only a solution for AMD/Intel gpu on windows.

Are we talking RAM or VRAM ? I have more than enough RAM but "only" 8 GB VRAM which seems to be to little to train embeddings. If you actually only need RAM for integrated graphic cards and it´s still a performance boost over CPU only then I wonder if this also works with dedicated GPU

@majorsauce Integrated gpu use system RAM as VRAM, in this case, directml could cause double system RAM usage than CPU training. On dedicated gpus, direct-ml also costs a little more VRAM than cuda or rocm.

Feb 13 '23 05:02 simonlsp

thank you,I have solved the problem according to your way

Feb 13 '23 05:02 YutuSec

@majorsauce Integrated gpu use system RAM as VRAM, in this case, directml could cause double system RAM usage than CPU training. On dedicated gpus, direct-ml also costs a little more VRAM than cuda or rocm.

@simonlsp so my current issue is I only have a dedicated GPU and no integrated one but the 8 GB VRAM is not enough to train embeddings atm. Do you know if it‘s possible for the dedicated GPU to utilize RAM instead of VRAM (which obviously will be a lot slower) to do the training ? Or would it come to the same performance as simply training CPU only ?

Feb 13 '23 12:02 majorsauce

you could define has_dml() wherever suits your need.

Can you elaborate on this point?

EDIT: Alright, for anyone else confused about this:

Add

def has_dml() -> bool:
    return has_dml

to devices.py, note that this will force the WebUI to run in DML mode so it's not a full fix.

My results: TLDR: Significantly slower than ROCm, but it atleast runs on Windows For reference, I have a 6800XT. Additionally, all testing has been done with Stable Diffusion v1.4, no LoRAs and the Euler A sampler.

Running with ./launch.py --skip-torch-cuda-test --lowvram --precision full --no-half, I get about 2.4s/it, using about 2GB of VRAM. On ROCm, I get about 8 it/s. Huge speed downgrade.

Running with ./launch.py --skip-torch-cuda-test or with --lowvram --precision full gives a RuntimeError: mat1 and mat2 must have the same dtype error when attempting generation, so at least for now we're forced to stick with the slower --no-half mode. Iteration time is similar to previous test. On ROCm, I still get about 8 it/s.

Running with ./launch.py --skip-torch-cuda-test --no-half mode is interesting: it uses about 6GB to load the model, then rapidly consumes all available VRAM once generation begins. We reach a much higher speed - about 3 it/s at highest - but immediately run into GPU memory allocation errors if we try to do more than one generation. On ROCm, I still get about 8 it/s.

I'm not sure why it's consuming so much VRAM. I think it may be an issue with DirectML not freeing memory, but I'm not experienced enough to tell.

New update!: python .\launch.py --skip-torch-cuda-test --no-half --precision full --opt-sub-quad-attention appears to be the best command for now, curtsey of @lshqqytiger in issue #3756. This appears to fix the out of memory error by limiting how much memory can be used, although WebUI still consumes most available VRAM. Additionally, I still get 3 it/s, even running with --no-half --precision full.

Feb 15 '23 00:02 glencoe2004

adding onto the post above, 6700xt about 1.5it/s using the windows stuff but 6it/s using rocm in linux so significant downgrade. To do this, I cloned Ishqqytiger's fork and added the def has_dml() and --no-half --precision full --opt-sub-quad-attention. Skip torch cuda test seems to be outdated in the fork or something

Feb 15 '23 21:02 caprexy

I've tested it on my 7900 XTX and it's faster than ONNX for me, but it's slower than MLIR (the SHARK repo uses that). Depending on the sampler I get 3-4 it/s and I've noticed that not all samplers are working. Some are throwing exceptions and some are hanging at 0%.

Feb 15 '23 22:02 Marormur

In my fork, I removed --skip-torch-cuda-test and added fallback to DirectML when MPS and CUDA unavailable.

I tested and fixed all samplers a few weeks ago, but it might have been broken by some updates. I'll check again. (Did you cloned original repos of crowsonkb/k-diffusion and Stability-AI/stablediffusion? Then retry with lshqqytiger/k-diffusion-directml and lshqqytiger/stablediffusion-directml. (I already setup git submodule for it) I fixed some samplers on these repos)

Feb 16 '23 03:02 lshqqytiger

thank you so much for releasing this issue. It works very well on my Radeon RX 570, with --lowvram.

Feb 25 '23 16:02 THEGOLDENPRO

Unrelated to this specific repository, but can we also use Direct-ml for using AMD GPUs with OpenAI Whisper?

Mar 01 '23 04:03 GrahamboJangles

DirectML can be applied to any projects using pytorch or tensorflow. (But some features does not work if they are not implemented yet)

Mar 01 '23 06:03 lshqqytiger

my laptop have 2 gpu. intel iris xe (IGPU) and intel xe max (DGPU), can DirectML use both of my gpu at the same time? system spec: i5 1135g7, 16 gb ddr4, 512 gb NVME ssd

Mar 16 '23 01:03 Nathan-dm

You can just select which gpu will be use.

Mar 16 '23 01:03 lshqqytiger

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Feature Request]: Enable direct-ml for this stable-diffusion-webui.

Is there an existing issue for this?

What would your feature do ?

Proposed workflow

Additional information

stable-diffusion-webui
stable-diffusion-webui copied to clipboard