stable-diffusion icon indicating copy to clipboard operation
stable-diffusion copied to clipboard

Instructions for setup and running on Mac Silicon chips

Open crsrusl opened this issue 2 years ago • 365 comments

Hi,

I’ve heard it is possible to run Stable-Diffusion on Mac Silicon (albeit slowly), would be good to include basic setup and instructions to do this.

Thanks, Chris

crsrusl avatar Aug 16 '22 11:08 crsrusl

I heard that pytorch updated to include apple MDS in the latest nightly release as well. Will this improve performance on M1 devices by utilizing Metal?

thelamedia avatar Aug 16 '22 14:08 thelamedia

With Homebrew

brew install [email protected]
pip3 install torch torchvision
pip3 install setuptools_rust
pip3 install -U git+https://github.com/huggingface/diffusers.git
pip3 install transformers scipy ftfy

Then start python3 and follow the instructions for using diffusers.

StableDiffusion is CPU-only on M1 Macs because not all the pytorch ops are implemented for Metal. Generating one image with 50 steps takes 4-5 minutes.

mja avatar Aug 20 '22 10:08 mja

Hi @mja,

thanks for these steps. I can get as far as the last one but then installing transformers fails with this error. (the install of setuptools_rust was successful )

      running build_ext
      running build_rust
      error: can't find Rust compiler

      If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.

      To update pip, run:

          pip install --upgrade pip

      and then retry package installation.

      If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

for context the first step failed to install python3.10 with brew so I did it with Conda instead. Not sure if having a full anaconda env installed is the problem

frenchie1980 avatar Aug 20 '22 13:08 frenchie1980

Just tried the pytorch nighly build with mps support and have some good news.

On my cpu (M1 Max) it runs very slow, almost 9 minutes per image, but with mps enabled it's ~18x faster: less than 30 seconds per image🤩

filipux avatar Aug 20 '22 20:08 filipux

Incredible! Would you mind sharing your exact setup so I can duplicate on my end?

thelamedia avatar Aug 20 '22 20:08 thelamedia

Unfortunately I got it working by many hours of trial and error, and in the end I don't know what worked. I'm not even a programmer, I'm just really good at googling stuff.

Basically my process was:

  • install pytorch nightly
  • update osx (12.3 required, mine was at 12.1)
  • use a conda environment, I could not get it to work without it
  • install missing packages using either pip or conda (one of them usually works)
  • go through every file and changetorch.device("cpu"/"cuda")to torch.device("mps")
  • in register_buffer() in ddim.py, change to attr = attr.to(torch.device("mps"), torch.float32)
  • in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...
  • in terminal write export PYTORCH_ENABLE_MPS_FALLBACK=1
  • + dozens of other things I have forgotten about...

I'm sorry that I can't be more helpful than this.

filipux avatar Aug 20 '22 22:08 filipux

Thanks. What are you currently using for checkpoints? Are you using research weights or are you using another model for now?

thelamedia avatar Aug 21 '22 00:08 thelamedia

I don't have access to the model so I haven't tested it, but based off of what @filipux said, I created this pull request to add mps support. If you can't wait for them to merge it you can clone my fork and switch to the apple-silicon-mps-support branch and try it out. Just follow the normal instructions but instead of running conda env create -f environment.yaml, run conda env create -f environment-mac.yaml. I think the only other requirement is that you have to have macOS 12.3 or greater.

magnusviri avatar Aug 21 '22 01:08 magnusviri

I couldn't quite get your fork to work @magnusviri, but based on most of @filipux's suggestions, I was able to install and generate samples on my M2 machine using https://github.com/einanao/stable-diffusion/tree/apple-silicon

einanao avatar Aug 21 '22 05:08 einanao

Edit: If you're looking at this comment now, you probably shouldn't follow this. Apparently a lot can change in 2 weeks!

Old comment

I got it to work fully natively without the CPU fallback, sort of. The way I did things is ugly since I prioritized making it work. I can't comment on speeds but my assumption is that using only the native MPS backend is faster?

I used the mps_master branch from kulinseth/pytorch as a base, since it contains an implementation for aten::index.Tensor_out that appears to work from what I can tell: https://github.com/Raymonf/pytorch/tree/mps_master

If you want to use my ugly changes, you'll have to compile PyTorch from scratch as I couldn't get the CPU fallback to work:

# clone the modified mps_master branch
git clone --recursive -b mps_master https://github.com/Raymonf/pytorch.git pytorch_mps && cd pytorch_mps

# dependencies to build (including for distributed)
# slightly modified from the docs
conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses pkg-config libuv

# build pytorch with explicit USE_DISTRIBUTED=1
USE_DISTRIBUTED=1 MACOSX_DEPLOYMENT_TARGET=12.4 CC=clang CXX=clang++ python setup.py install

I based my version of the Stable Diffusion code on the code from PR #47's branch, you can find my fork here: https://github.com/Raymonf/stable-diffusion/tree/apple-silicon-mps-support

Just your typical pip install -e . should work for this, there's nothing too special going on here, it's just not what I'd call upstream-quality code by any means. I have only tested txt2img, but I did try to modify knn2img and img2img too.

Edit: It definitely takes more than 20 seconds per image at the default settings with either sampler, not sure if I did something wrong. Might be hitting https://github.com/pytorch/pytorch/issues/77799 :(


@magnusviri: You are free to take anything from my branch for yourself if it's helpful at all, thanks for the PR 😃

Raymonf avatar Aug 21 '22 05:08 Raymonf

@Raymonf: I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that @einanao made as well. The only difference I could see was in ldm/models/diffusion/plms.py

einanao:

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device("cuda"):
                attr = attr.type(torch.float32).to(torch.device("mps")).contiguous()

Raymonf:

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device(self.device_available):
                attr = attr.to(torch.float32).to(torch.device(self.device_available))

I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu.

magnusviri avatar Aug 21 '22 05:08 magnusviri

Pretty sure my version is redundant (I also added a downstream call to .contiguous(), but forgot to remove this one)

On Sun, Aug 21, 2022 at 1:49 AM, James Reynolds @.***> wrote:

@.(https://github.com/Raymonf): I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that @.(https://github.com/einanao) made as well. The only difference I could see was in ldm/models/diffusion/plms.py

einanao:

def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.type(torch.float32).to(torch.device("mps")).contiguous()

Raymonf:

def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device(self.device_available): attr = attr.to(torch.float32).to(torch.device(self.device_available))

I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

einanao avatar Aug 21 '22 05:08 einanao

@einanao Maybe not! How long does yours take to run the default seed and prompt with full precision? GNU time reports ~~4.5~~2.5 minutes with the fans at 100% on a 16 inch M1 Max, which is way longer than 20 seconds. I'm curious if you using the CPU fallback for some parts makes it faster at all.

Raymonf avatar Aug 21 '22 06:08 Raymonf

It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022

einanao avatar Aug 21 '22 15:08 einanao

I'm getting this error when trying to run with the laion400 data set:

return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Is this an issue with the torch functional.py script?

thelamedia avatar Aug 22 '22 00:08 thelamedia

Yes, see @filipux's earlier comment:

in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...

einanao avatar Aug 22 '22 00:08 einanao

@einanao thank you. One step closer, but now I'm getting this: return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.mps.enabled) AttributeError: module 'torch.backends.mps' has no attribute 'enabled'

Here is my function:

def layer_norm( input: Tensor, normalized_shape: List[int], weight: Optional[Tensor] = None, bias: Optional[Tensor] = None, eps: float = 1e-5, ) -> Tensor:

if has_torch_function_variadic(input, weight, bias):
    return handle_torch_function(
        layer_norm, (input.contiguous(), weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
    )
return torch.layer_norm(input.contiguous(), normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)`

thelamedia avatar Aug 22 '22 01:08 thelamedia

It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022

For benchmarking purposes - I'm at ~150s (2.5 minutes) on each iteration past the first, which was over 500s after setting up with the steps in these comments.

14" 2021 Macbook Pro with base specs. (M1 Pro chip)

byhringo avatar Aug 22 '22 17:08 byhringo

This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).

Automatt avatar Aug 22 '22 20:08 Automatt

This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).

What steps did you follow? I tried three apple forks but they all are taking 1h to generate using the sample command (python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms) I'm using pytorch nightly btw.

henrique-galimberti avatar Aug 22 '22 20:08 henrique-galimberti

@henrique-galimberti I followed these steps:

  • Install PyTorch nightly
  • Used this branch referenced above from @magnusviri
  • Modify functional.py as noted above here to resolve view size not compatible issue

Automatt avatar Aug 22 '22 21:08 Automatt

mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis

recurrence avatar Aug 22 '22 21:08 recurrence

Looks like there's a ticket for the reshape error at https://github.com/pytorch/pytorch/issues/80800

recurrence avatar Aug 22 '22 21:08 recurrence

mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis

Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master. The last commit was 15 hours ago. But master has commits kind of non-stop for the last 9 hours.

magnusviri avatar Aug 22 '22 22:08 magnusviri

@henrique-galimberti I followed these steps:

  • Install PyTorch nightly
  • Used this branch referenced above from @magnusviri
  • Modify functional.py as noted above here to resolve view size not compatible issue

Where can I find the functional.py file ?

pnodseth avatar Aug 22 '22 22:08 pnodseth

Where can I find the functional.py file ?

import torch
torch.__file__

For me the path is below. Your path will be different.

'/Users/lab/.local/share/virtualenvs/lab-2cY4ojCF/lib/python3.10/site-packages/torch/__init__.py'

Then replace __init__.py with nn/functional.py

cgodley avatar Aug 22 '22 22:08 cgodley

I change conda env to use rosetta and it is faster than before, but still waaaay too slow: image

henrique-galimberti avatar Aug 22 '22 22:08 henrique-galimberti

Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master.

It was merged 5 days ago so it should be in the regular PyTorch nightly that you can get directly from the PyTorch site.

recurrence avatar Aug 22 '22 22:08 recurrence

@henrique-galimberti I followed these steps:

* Install PyTorch nightly

* Used [this branch](https://github.com/CompVis/stable-diffusion/pull/47) referenced above from @magnusviri

* Modify functional.py as noted above [here](https://github.com/CompVis/stable-diffusion/issues/25#issuecomment-1221667017) to resolve view size not compatible issue

I also followed these steps and confirmed MPS was being used (printed the return value of get_device()) but it's taking about 31.74s/it, which seems very slow.

  • macOS 12.5
  • MacBook Pro M1 14" base model (16GB of memory, 14 GPU cores)

cgodley avatar Aug 22 '22 23:08 cgodley

After waiting a few minutes the average speed increased to 12.74s/it

I closed all other apps, mem pressure graph looks like this:

image

cgodley avatar Aug 22 '22 23:08 cgodley