stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Feature Request]: Support for Apple's Core ML Stable Diffusion

Open jkcarney opened this issue 2 years ago • 20 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

https://github.com/apple/ml-stable-diffusion

Apple very recently added support to convert Stable Diffusion models to the CoreML format to allow for faster generation time.

  1. It would be nice to support this conversion pipeline within the web UI, perhaps as an option in an extras tab or checkpoint merger (its not really a merge per say, but it could apply?)
  2. Allow the webUI to run the coreML models instead of the regular SD pytorch models.

Proposed workflow

  1. Go to the extras tab or checkpoint merger tab
  2. Select a script or similar to convert a .ckpt file in your models directory to the coreML format
  3. Allow use of the coreML model within the webUI for Apple users.

Additional information

No response

jkcarney avatar Dec 01 '22 20:12 jkcarney

https://github.com/apple/ml-stable-diffusion/issues/9

NightMachinery avatar Dec 02 '22 05:12 NightMachinery

This please!

ioma8 avatar Dec 02 '22 07:12 ioma8

https://github.com/apple/ml-stable-diffusion

I've tried running and inspecting the sample from the repository above (still investigating), It looks like the coreml format does not reduce image generation time. Rather than coreml, the pytorch implementation is slightly faster (at least on my MacStudio, M1Ultra, 48GPU). It's good to want Automatic1111 to support coreml format, but before that, each person who has M1Mac should do some benchmarking and carefully consider whether it's really an urgent matter to request.

Translated from Japanese to English by Google.

autumnmotor avatar Dec 02 '22 14:12 autumnmotor

Yes I have also finally tried it on Mac M1 and it is indeed slower than current implementations.

ioma8 avatar Dec 02 '22 16:12 ioma8

u tried the python or swift one ?

sascha1337 avatar Dec 02 '22 17:12 sascha1337

Yes I have also finally tried it on Mac M1 and it is indeed slower than current implementations.

Please publish all the relevant details, e.g., macOS version (latest 13.1 beta is needed), which Mac, which compute units, Swift or Python, and whether you have included the model loading time.

Their own benchmarks say that an M2 generates an image in 23 seconds, which is certainly much faster than PyTorch. I myself don't have macOS 13.1 and Xcode installed to test.

NightMachinery avatar Dec 02 '22 19:12 NightMachinery

I can test on Macbook Air M2 with 24GB of RAM. But a little guidance on how, no crazy detail needed, would be nice.

ronytomen avatar Dec 02 '22 21:12 ronytomen

MacStudio(M1ultra, 128GB RAM, 48cores GPU), macOS Ventura(13.1beta), xcode14.1 imagesize:512x512 , steps:20(21 in coreml_python), model:CompVis/stable-diffusion-v1-4 (not include model load time)

WebUI:2.61it/s (use MPS) coreml(Python):2.64it/s (use cpu,gpu,ane) coreml(Swift):2.23it/s (use cpu,gpu,ane)

There doesn't seem to be a dramatic difference in speed.

Automatic1111-SD-WebUI(sampling method:Euler a) use MPS

Total progress: 100%|███████████████████████████| 20/20 [00:07<00:00, 2.61it/s]

use cpuonly(--use-cpu all)

Total progress: 100%|███████████████████████████| 20/20 [01:10<00:00, 3.51s/it]

=0.285it/s

coreml(Python, Scheduler:default, maybe DDIM) python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit ALL --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:07<00:00, 2.64it/s]

python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit CPU_AND_GPU --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:12<00:00, 1.73it/s]

python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit CPU_AND_NE --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:17<00:00, 1.22it/s]

coreml(Swift, Scheduler or Samplemethod: Unknown because I don't know much about swift) swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units all

Step 20 of 20 [mean: 2.23, median: 2.50, last 2.46] step/sec

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units cpuAndNeuralEngine

Step 20 of 20 [mean: 1.16, median: 1.17, last 1.16] step/sec

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units cpuAndGPU

Step 20 of 20 [mean: 1.86, median: 2.96, last 2.95] step/sec

swift run StableDiffusionSample "a photo of an astronaut riding a horse on mars" --resource-path ./sdmodel/Resources/ --seed 93 --output-path ./out --step-count 20 --compute-units cpuOnly

Step 20 of 20 [mean: 0.12, median: 0.12, last 0.12] step/sec

autumnmotor avatar Dec 02 '22 23:12 autumnmotor

@autumnmotor ~~Aren't your results much better than Torch MPS? 1.22it/s vs 2.61it/s.~~

use MPS

Total progress: 100%|███████████████████████████| 20/20 [00:07<00:00, 2.61it/s]
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i ./sdmodel -o ./out --compute-unit CPU_AND_NE --seed 93 --num-inference-steps 20

100%|███████████████████████████████████████████| 21/21 [00:17<00:00, 1.22it/s]

NightMachinery avatar Dec 02 '22 23:12 NightMachinery

INFO:python_coreml_stable_diffusion.coreml_model:Loading a CoreML model through coremltools triggers compilation every time. The Swift package we provide uses precompiled Core ML models (.mlmodelc) to avoid compile-on-load.

... and rumors tell us, we should use beta4 ventura 13.1

what Xact MacOS build u use, sir ?

sascha1337 avatar Dec 03 '22 00:12 sascha1337

@NightMachinery Hmmm... I think that the higher the unit of "it/s" (iterate per second), the better the performance... Also check the total processing time on the left.(not include model load) webui(use mps):7sec coreml_python:17sec

@sascha1337

... and rumors tell us, we should use beta4 ventura 13.1

my macOS is 13.1 Beta(22C5033e). it's mean beta "1" Thanks for the very useful information. Luckily I'm in the Apple Developer Program, so I'll try it later.

autumnmotor avatar Dec 03 '22 00:12 autumnmotor

macOS13.1beta1 -> beta4

WebUI(use MPS):2.61it/s -> 2.66it/s coreml(Python,use cpu,gpu,ane):2.64it/s -> 2.65it/s coreml(Swift,use cpu,gpu,ane):2.23it/s -> 2.23it/s

I still need to look into it more carefully, but I think the current conclusion is within the margin of error.

autumnmotor avatar Dec 03 '22 01:12 autumnmotor

@autumnmotor ser what sampler, ddim ?

sascha1337 avatar Dec 04 '22 00:12 sascha1337

@autumnmotor Could you also report coreml(Python,use cpu,gpu) please?

atiorh avatar Dec 17 '22 19:12 atiorh

There is a basic implementation now

https://github.com/godly-devotion/mochi-diffusion

Would be great if you guys somehow teamed up!

juan9999 avatar Dec 19 '22 16:12 juan9999

How would we go about testing the CoreML versions already converted? I assume I can't just drop all of the files into the models directory?

rjp23 avatar Feb 03 '23 14:02 rjp23

use MPS

how do we use MPS with the webUI? I thought it was only CPU

RnbWd avatar Feb 09 '23 04:02 RnbWd

Please do it

namnhfreelancer avatar Mar 25 '23 17:03 namnhfreelancer

I'm a Mac user and I tried the Draw Things software that supports CoreML. On mac mini M1, same as anything v3, step30, it takes about 2min40s to generate a card using webui and 45s to generate a card using DT, so I think it is still necessary to support CoreML. I'm on another mac, it's M1 Pro, 20s for DT, 35s for webui. (In addition to teasing, some interface design reference to the DT)

GrinZero avatar Apr 29 '23 11:04 GrinZero

please.

9Somboon avatar Jul 02 '23 06:07 9Somboon

bump

I just compiled the HF Diffusers app on my M2Max and can whip out a 45-step sd2.1 image in about 18.5s vs 43s with A1111 and the pruned model

genevera avatar Aug 04 '23 18:08 genevera

there are apps that use apple's core ML stable diffusion. The best one I could find is here: https://github.com/godly-devotion/MochiDiffusion

however, if you've ever tried using apples's core ml implementation, you might have noticed that it takes a LONG time to initialize the model everytime it first runs. Using the CLI from apple's examples, it takes like a minute on my m1 to before the model even starts running. I think Mochi caches the core ml making it more useful. On macbook air m1, I'm only seeing a 20% increase in diffusion speed, at most, but the startup time for loading any model makes it not worth it. A M1 pro / max, or M2 pro / max, might see much more significant gains than the m1 base model

RnbWd avatar Aug 08 '23 02:08 RnbWd

Dude this depends on the RAM 8gb u got bad times 128gb the way to go, keep UNET chunks in cache

sascha1337 avatar Aug 08 '23 03:08 sascha1337

Please do it!

foolyoghurt avatar Sep 03 '23 13:09 foolyoghurt

DrawThings has an HTTP API. Maybe something could be done to send requests etc for things that it can handle over to that while keeping A1111 the front-end?

genevera avatar Oct 27 '23 01:10 genevera

there are apps that use apple's core ML stable diffusion. The best one I could find is here: https://github.com/godly-devotion/MochiDiffusion

however, if you've ever tried using apples's core ml implementation, you might have noticed that it takes a LONG time to initialize the model everytime it first runs. Using the CLI from apple's examples, it takes like a minute on my m1 to before the model even starts running. I think Mochi caches the core ml making it more useful. On macbook air m1, I'm only seeing a 20% increase in diffusion speed, at most, but the startup time for loading any model makes it not worth it. A M1 pro / max, or M2 pro / max, might see much more significant gains than the m1 base model

Check out Draw Things... it's not open source but it is free and it beats everything else in performance, I think.

genevera avatar Oct 27 '23 01:10 genevera

Dude this depends on the RAM 8gb u got bad times 128gb the way to go, keep UNET chunks in cache

128GB Mac memory lol. Apples golden money earner. Apple silicon only currently goes up to 96 I believe.

marshalleq avatar Dec 30 '23 00:12 marshalleq