InvokeAI
InvokeAI copied to clipboard
[enhancement]: Integrate Apple's CoreML optimizations for SD
Is there an existing issue for this?
- [X] I have searched the existing issues
Contact Details
No response
What should this feature add?
Apple has announced some tooling for SD optimization on M* Macs. If we can get it integrated quickly (or first...), it could be a huge boon to Invoke's reach.
https://machinelearning.apple.com/research/stable-diffusion-coreml-apple-silicon
https://github.com/apple/ml-stable-diffusion
Still reading, but I wanted to raise the issue with this community.
Alternatives
No response
Aditional Content
No response
just looked a little into this - looks really promising, but some issues as of now:
- models have to be converted in order to work. Not a big deal, but worth calling out..
- converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.
- The python generation pipeline currently has to load the model from scratch every time (2-3 mins!) and is unable to cache it. Their FAQ describes it in more detail. There's a swift pipeline that can avoid it, but not sure that helps here..
I'd love to see this in action here. Hope my digging helps a bit, and hope their repo advanced quickly to fix these shortcomings.
Please add this!
converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.
I have no experience with CoreML, but at least from the documentation, this doesn't seem like a major issue, it can even be modified after the model has been converted. The process seems very similar to tracing a model with PyTorch JIT.
The last point will probably be more tedious, since it kinda nullifies the speed gains. Will try to understand a bit better to see how hard it is to integrate, just out of curiosity.
I've been doing a lot of testing with Apple's repo. Here are the modes that worked best on my M1 Max with 24-core gpu and 32GB:
Using --attention-implementation ORIGINAL
with CPU_AND_GPU
, I got up to 2.96it/s using ~32W and ~8GB of RAM. This combination should be best for maximum performance on Pro/Max/Ultra.
Using --attention-implementation SPLIT_EINSUM
with CPU_AND_NE
, I got 1.44its/s using <5W and <1GB of RAM. This combination makes a great high-efficiency mode, and should even get similar performance on an iphone/ipad with an A14 or better.
Using --attention-implementation SPLIT_EINSUM
with ALL
(cpu, gpu and neural engine), I got 2.46 it/s using ~13W and ~3GB of RAM, and not maxing-out my GPU. This combination makes a good balanced mode, and will likely offer the highest performance on M1/M2 chips, and possible the M1 Pro as well, since it has 2/3 the gpu cores of my M1 Max.
I'll also note that the versions using the neural engine took about 4 minutes to initially load the model, though the Swift implementation was able to do this only on the first run, and load it more quickly after that. Asitop seemed to indicate that the neural engine wasn't running anywhere near maximum generating 512x512 images, but the code in Apple's repo doesn't let you modify that. That combined with the almost nonexistent memory footprint makes me thing this might work really well for generating larger images with hires_fix, or using large tiles with embiggen.
Ideally, Invoke would support these three modes (power, efficiency, and balanced), and let the user choose among them, depending on which processor they have, and how much power they want to use (having my laptop act as a lap warmer in the current weather is kinda nice, but if I was on battery power, I'd definitely want to use the neural engine instead of my GPU.
@whosawhatsis
Using --attention-implementation SPLIT_EINSUM with ALL (cpu, gpu and neural engine), I got 2.46 it/s using ~13W and ~3GB of RAM, and not maxing-out my GPU. This combination makes a good balanced mode, and will likely offer the highest performance on M1/M2 chips, and possible the M1 Pro as well, since it has 2/3 the gpu cores of my M1 Max.
I tried this but I am getting 8gb+ ram usage. I used this command, am I doing something wrong?
python -m python_coreml_stable_diffusion.pipeline --prompt "a photo of an astronaut riding a horse on mars" -i models/coreml-stable-diffusion-v1-4_split_einsum_packages -o output/ --compute-unit ALL --seed 93
I have this in models folder coreml-stable-diffusion-v1-4_split_einsum_packages
just looked a little into this - looks really promising, but some issues as of now:
- models have to be converted in order to work. Not a big deal, but worth calling out..
- converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.
- The python generation pipeline currently has to load the model from scratch every time (2-3 mins!) and is unable to cache it. Their FAQ describes it in more detail. There's a swift pipeline that can avoid it, but not sure that helps here..
I'd love to see this in action here. Hope my digging helps a bit, and hope their repo advanced quickly to fix these shortcomings.
Thanks for the link. That led me to the hugging face pipeline. They show a good example there defining a call to allow for flexible height and width: https://github.com/huggingface/diffusers/blob/4125756e88e82370c197fecf28e9f0b4d7eee6c3/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L412
height: Optional[int] = None,
width: Optional[int] = None,
just looked a little into this - looks really promising, but some issues as of now:
- models have to be converted in order to work. Not a big deal, but worth calling out..
- converted models do not seem to be able to accept any other width/height input other than 512x512 (or whatever it was originally trained on. see https://github.com/apple/ml-stable-diffusion/blob/main/python_coreml_stable_diffusion/pipeline.py#L225). That looks like a blocker until they fix it.
- The python generation pipeline currently has to load the model from scratch every time (2-3 mins!) and is unable to cache it. Their FAQ describes it in more detail. There's a swift pipeline that can avoid it, but not sure that helps here..
I'd love to see this in action here. Hope my digging helps a bit, and hope their repo advanced quickly to fix these shortcomings.
Both the 2nd and 3rd points only apply to the SPLIT_EINSUM version of converted models (which works with CPU+ANE). For ORIGINAL version (which works with CPU+GPU) it's possible to change the width/height and model loading is fast.
Apart from the base M1/M2 where ANE outperforms GPU, ORIGINAL version works better for Pro/Max/Ultra. Some more benchmarks can be found on Apple's project page, and in PromptToImage's project page here.
Btw Apple recently updated its project to support even ControlNet, and MochiDiffusion already added support for it. Yay for competition? Anyway the future of Stable Diffusion on Apple Silicon looks really promising. Can't wait for SDXL!