stable-diffusion
stable-diffusion copied to clipboard
Instructions for setup and running on Mac Silicon chips
Hi,
I’ve heard it is possible to run Stable-Diffusion on Mac Silicon (albeit slowly), would be good to include basic setup and instructions to do this.
Thanks, Chris
I heard that pytorch updated to include apple MDS in the latest nightly release as well. Will this improve performance on M1 devices by utilizing Metal?
With Homebrew
brew install [email protected]
pip3 install torch torchvision
pip3 install setuptools_rust
pip3 install -U git+https://github.com/huggingface/diffusers.git
pip3 install transformers scipy ftfy
Then start python3
and follow the instructions for using diffusers.
StableDiffusion is CPU-only on M1 Macs because not all the pytorch ops are implemented for Metal. Generating one image with 50 steps takes 4-5 minutes.
Hi @mja,
thanks for these steps. I can get as far as the last one but then installing transformers fails with this error. (the install of setuptools_rust was successful )
running build_ext
running build_rust
error: can't find Rust compiler
If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.
To update pip, run:
pip install --upgrade pip
and then retry package installation.
If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects
for context the first step failed to install python3.10 with brew so I did it with Conda instead. Not sure if having a full anaconda env installed is the problem
Just tried the pytorch nighly build with mps support and have some good news.
On my cpu (M1 Max) it runs very slow, almost 9 minutes per image, but with mps enabled it's ~18x faster: less than 30 seconds per image🤩
Incredible! Would you mind sharing your exact setup so I can duplicate on my end?
Unfortunately I got it working by many hours of trial and error, and in the end I don't know what worked. I'm not even a programmer, I'm just really good at googling stuff.
Basically my process was:
- install pytorch nightly
- update osx (12.3 required, mine was at 12.1)
- use a conda environment, I could not get it to work without it
- install missing packages using either pip or conda (one of them usually works)
- go through every file and change
torch.device("cpu"/"cuda")
totorch.device("mps")
- in register_buffer() in ddim.py, change to
attr = attr.to(torch.device("mps"), torch.float32)
- in layer_norm() in functional.py (part of pytorch I guess), change to
return torch.layer_norm(input.contiguous(), ...
- in terminal write
export PYTORCH_ENABLE_MPS_FALLBACK=1
- + dozens of other things I have forgotten about...
I'm sorry that I can't be more helpful than this.
Thanks. What are you currently using for checkpoints? Are you using research weights or are you using another model for now?
I don't have access to the model so I haven't tested it, but based off of what @filipux said, I created this pull request to add mps support. If you can't wait for them to merge it you can clone my fork and switch to the apple-silicon-mps-support branch and try it out. Just follow the normal instructions but instead of running conda env create -f environment.yaml
, run conda env create -f environment-mac.yaml
. I think the only other requirement is that you have to have macOS 12.3 or greater.
I couldn't quite get your fork to work @magnusviri, but based on most of @filipux's suggestions, I was able to install and generate samples on my M2 machine using https://github.com/einanao/stable-diffusion/tree/apple-silicon
Edit: If you're looking at this comment now, you probably shouldn't follow this. Apparently a lot can change in 2 weeks!
Old comment
I got it to work fully natively without the CPU fallback, sort of. The way I did things is ugly since I prioritized making it work. I can't comment on speeds but my assumption is that using only the native MPS backend is faster?
I used the mps_master branch from kulinseth/pytorch as a base, since it contains an implementation for aten::index.Tensor_out
that appears to work from what I can tell: https://github.com/Raymonf/pytorch/tree/mps_master
If you want to use my ugly changes, you'll have to compile PyTorch from scratch as I couldn't get the CPU fallback to work:
# clone the modified mps_master branch
git clone --recursive -b mps_master https://github.com/Raymonf/pytorch.git pytorch_mps && cd pytorch_mps
# dependencies to build (including for distributed)
# slightly modified from the docs
conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses pkg-config libuv
# build pytorch with explicit USE_DISTRIBUTED=1
USE_DISTRIBUTED=1 MACOSX_DEPLOYMENT_TARGET=12.4 CC=clang CXX=clang++ python setup.py install
I based my version of the Stable Diffusion code on the code from PR #47's branch, you can find my fork here: https://github.com/Raymonf/stable-diffusion/tree/apple-silicon-mps-support
Just your typical pip install -e .
should work for this, there's nothing too special going on here, it's just not what I'd call upstream-quality code by any means. I have only tested txt2img, but I did try to modify knn2img and img2img too.
Edit: It definitely takes more than 20 seconds per image at the default settings with either sampler, not sure if I did something wrong. Might be hitting https://github.com/pytorch/pytorch/issues/77799 :(
@magnusviri: You are free to take anything from my branch for yourself if it's helpful at all, thanks for the PR 😃
@Raymonf: I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that @einanao made as well. The only difference I could see was in ldm/models/diffusion/plms.py
einanao:
def register_buffer(self, name, attr):
if type(attr) == torch.Tensor:
if attr.device != torch.device("cuda"):
attr = attr.type(torch.float32).to(torch.device("mps")).contiguous()
Raymonf:
def register_buffer(self, name, attr):
if type(attr) == torch.Tensor:
if attr.device != torch.device(self.device_available):
attr = attr.to(torch.float32).to(torch.device(self.device_available))
I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu.
Pretty sure my version is redundant (I also added a downstream call to .contiguous(), but forgot to remove this one)
On Sun, Aug 21, 2022 at 1:49 AM, James Reynolds @.***> wrote:
@.(https://github.com/Raymonf): I merged your changes with mine and so they are in the pull request now. It caught everything that I missed and it almost identical to the changes that @.(https://github.com/einanao) made as well. The only difference I could see was in ldm/models/diffusion/plms.py
einanao:
def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.type(torch.float32).to(torch.device("mps")).contiguous()
Raymonf:
def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device(self.device_available): attr = attr.to(torch.float32).to(torch.device(self.device_available))
I don't know what the code differences are, except that I read that adding .contiguous() fixes bugs when falling back to the cpu.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@einanao Maybe not! How long does yours take to run the default seed and prompt with full precision? GNU time reports ~~4.5~~2.5 minutes with the fans at 100% on a 16 inch M1 Max, which is way longer than 20 seconds. I'm curious if you using the CPU fallback for some parts makes it faster at all.
It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022
I'm getting this error when trying to run with the laion400 data set:
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Is this an issue with the torch functional.py script?
Yes, see @filipux's earlier comment:
in layer_norm() in functional.py (part of pytorch I guess), change to return torch.layer_norm(input.contiguous(), ...
@einanao thank you. One step closer, but now I'm getting this:
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.mps.enabled) AttributeError: module 'torch.backends.mps' has no attribute 'enabled'
Here is my function:
def layer_norm( input: Tensor, normalized_shape: List[int], weight: Optional[Tensor] = None, bias: Optional[Tensor] = None, eps: float = 1e-5, ) -> Tensor:
if has_torch_function_variadic(input, weight, bias):
return handle_torch_function(
layer_norm, (input.contiguous(), weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
)
return torch.layer_norm(input.contiguous(), normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)`
It takes me 1.5 minutes to generate 1 sample on a 13 inch M2 2022
For benchmarking purposes - I'm at ~150s (2.5 minutes) on each iteration past the first, which was over 500s after setting up with the steps in these comments.
14" 2021 Macbook Pro with base specs. (M1 Pro chip)
This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).
This worked for me. I'm seeing about 30 seconds per image on a 14" M1 Max MacBook Pro (32 GPU core).
What steps did you follow? I tried three apple forks but they all are taking 1h to generate using the sample command (python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms) I'm using pytorch nightly btw.
@henrique-galimberti I followed these steps:
- Install PyTorch nightly
- Used this branch referenced above from @magnusviri
- Modify functional.py as noted above here to resolve view size not compatible issue
mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis
Looks like there's a ticket for the reshape error at https://github.com/pytorch/pytorch/issues/80800
mps support for aten::index.Tensor_out is now in pytorch nightly according to Denis
Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master. The last commit was 15 hours ago. But master has commits kind of non-stop for the last 9 hours.
@henrique-galimberti I followed these steps:
- Install PyTorch nightly
- Used this branch referenced above from @magnusviri
- Modify functional.py as noted above here to resolve view size not compatible issue
Where can I find the functional.py file ?
Where can I find the functional.py file ?
import torch
torch.__file__
For me the path is below. Your path will be different.
'/Users/lab/.local/share/virtualenvs/lab-2cY4ojCF/lib/python3.10/site-packages/torch/__init__.py'
Then replace __init__.py
with nn/functional.py
I change conda env to use rosetta and it is faster than before, but still waaaay too slow:
Is that the pytorch nightly branch? That particular branch is 1068 commits ahead and 28606 commits behind the master.
It was merged 5 days ago so it should be in the regular PyTorch nightly that you can get directly from the PyTorch site.
@henrique-galimberti I followed these steps:
* Install PyTorch nightly * Used [this branch](https://github.com/CompVis/stable-diffusion/pull/47) referenced above from @magnusviri * Modify functional.py as noted above [here](https://github.com/CompVis/stable-diffusion/issues/25#issuecomment-1221667017) to resolve view size not compatible issue
I also followed these steps and confirmed MPS was being used (printed the return value of get_device()
) but it's taking about 31.74s/it
, which seems very slow.
- macOS 12.5
- MacBook Pro M1 14" base model (16GB of memory, 14 GPU cores)
After waiting a few minutes the average speed increased to 12.74s/it
I closed all other apps, mem pressure graph looks like this:
