swift-coreml-diffusers icon indicating copy to clipboard operation
swift-coreml-diffusers copied to clipboard

[Open-to-community] Benchmark swift-coreml-diffusers on different Mac hardware

Open Vaibhavs10 opened this issue 1 year ago • 40 comments

Hey hey,

We are on a mission to provide a first-class, one-click solution to blazingly fast diffusers inference on Mac. In order for us to get a better idea of our framework, we'd like to get inference time benchmarks for the app.

Currently, we are explicitly looking for benchmarks on:

  • [ ] M1 Pro - @tcapelle, @emwdx, @Pindar777
  • [ ] M1 Pro (6/14/16) - @abazlinton
  • [ ] M2 Pro - @tcapelle, @mja, @SerCeMan
  • [ ] M2 Max - @Tz-H, @lovelace

You can do so by following the below steps:

  1. Download the latest version of the Diffusers app from the App store.
  2. Select one of the three options in the Advanced
  3. Insert a random prompt for e.g. A Labrador playing in the fields.
  4. Run inference and make a note of the time taken for inference.

Note: Do make sure to run inference multiple times as the framework sometimes requires to prepare the weights in order to run it in the most efficient way possible.

Ping @pcuenca and @vaibhavs10 for any queries or questions!

Happy diffusing 🧨

Vaibhavs10 avatar Feb 24 '23 09:02 Vaibhavs10

I can do the M2Pro Mac Mini!

tcapelle avatar Feb 24 '23 15:02 tcapelle

I can do the M2Pro Mac Mini!

Cool, assigned it to you above :)

pcuenca avatar Feb 24 '23 15:02 pcuenca

With default settings with 25 steps:

Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores

  • ANE: 15.4s, 15.2, 15.2
  • GPU: 13.7s, 13.9s, 13.7s (Using less than 4GB of ram 🤯)
  • GPU + ANE: 15.4, 15.2, 15.4

Mac Mini with M2 Pro GPU 16 Cores - 16GB of ram - 6 perf cores

  • ANE: For some reason, on this machine the ANE was the default: 10.4, 10.3, 10.4 (no ram usage reported?!)
  • GPU: 12.4s, 12.3s, 12.3s
  • ANE+GPU: 10.9, 10.8, 10.8

tcapelle avatar Feb 24 '23 15:02 tcapelle

Thanks a lot @tcapelle that's super helpful!

For some reason, on this machine the ANE was the default

Yeah, we have a simple rule (based on the number of performance cores, which is a good proxy for the rest of the hardware). It looks like it worked in both your computers, didn't it? (the best option was selected by default).

A couple of questions, if you can.

  • How many performance cores does each computer have? (I find sysctl hw.perflevel0.physicalcpu to be easy).
  • What model did you test? Relative performance is usually consistent across models.

The ANE+GPU performance is very close in both computers! I'm expecting ANE+GPU to beat just ANE in some of the MBP M2 Pro combinations.

pcuenca avatar Feb 24 '23 16:02 pcuenca

I used default settings, so it's sd-base-2.0

tcapelle avatar Feb 24 '23 16:02 tcapelle

do you know a trick to query how many GPU cores the machine has? Would be really cool to retrieve this info programatically so we can log it to the wandb info: image

tcapelle avatar Feb 24 '23 16:02 tcapelle

I suppose terminal is ok:

ioreg -l | grep gpu-core-count | tail -1 | awk -F"=\ " '{print $NF}'

(Only produces results on Apple Silicon)

pcuenca avatar Feb 24 '23 16:02 pcuenca

there is also this thingy: https://github.com/tlkh/asitop

tcapelle avatar Feb 24 '23 16:02 tcapelle

there is also this thingy: https://github.com/tlkh/asitop

Oh interesting. This is what they do: https://github.com/tlkh/asitop/blob/main/asitop/utils.py#L123

pcuenca avatar Feb 24 '23 16:02 pcuenca

Same config as tcapelle above for comparison.

Default settings with 25 steps, Macbook Pro 14" with M1 Pro GPU 16 Cores - 16GB of ram - 8 perf cores

ANE: 15.2, 15.1, 15.3 GPU: 13.9, 13.7, 13.7 GPU + ANE: 14.2, 14.5, 14.4

Similar results as above, so that's cool.

emwdx avatar Feb 25 '23 13:02 emwdx

Thanks a lot @emwdx! I think the app should have selected the best option (GPU) for you, is that correct?

Interestingly, your GPU+ANE combination is better than @tcapelle's (although still not better than just GPU).

pcuenca avatar Feb 25 '23 13:02 pcuenca

It did automatically select the GPU, yes :)

On Sat, Feb 25, 2023 at 1:16 PM Pedro Cuenca @.***> wrote:

Thanks a lot @emwdx https://github.com/emwdx! I think the app should have selected the best option (GPU) for you, is that correct?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/swift-coreml-diffusers/issues/31#issuecomment-1445118587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALV7N3DWBME4YK5X3LJE4TWZIAZVANCNFSM6AAAAAAVGWC7KA . You are receiving this because you were mentioned.Message ID: @.***>

emwdx avatar Feb 25 '23 13:02 emwdx

Amazing work huggingface team ❤️!

Here are mine -

14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)


ANE: 15.2, 15.2, 15.2 GPU: 15.1, 15.1, 15.2 ANE+GPU: 14.4, 14.5, 14.4

abazlinton avatar Feb 25 '23 15:02 abazlinton

14" MacBook M2 Max - 64 GB - 30 cores

hw.perflevel0.physicalcpu: 8

Settings

  • Models: stabilityai/stable-diffusion-2-1-base
  • Prompts: A Labrador playing in the fields
  • Steps: 25
  • Seed: -1

Result

  • GPU: 7.7, 7.7, 7.6
  • ANE: 10.3, 10.3, 10.3
  • GPU + ANE: 10.6, 10.6, 10.7

Tz-H avatar Feb 26 '23 00:02 Tz-H

Which model should we run for this benchmark?

julien-c avatar Feb 26 '23 08:02 julien-c

@julien-c Ideally, the 4 we used in the benchmark: https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks

But results seem consistent across models, so most people are doing just stabilityai/stable-diffusion-2-base or stabilityai/stable-diffusion-2-1-base.

pcuenca avatar Feb 26 '23 08:02 pcuenca

Amazing work huggingface team ❤️!

Here are mine -

14" MacBook M1 Pro - 14 GPU cores / 6 performance cores - All settings default (SD 2-base)

ANE: 15.2, 15.2, 15.2 GPU: 15.1, 15.1, 15.2 ANE+GPU: 14.4, 14.5, 14.4

Very interesting test @abazlinton! This is the first time we see GPU+ANE beating either GPU or ANE. We'll try to improve our heuristics to select that combination by default for those systems. Thank you!

pcuenca avatar Feb 26 '23 09:02 pcuenca

Nice computer @Tz-H! We were very interested to see performance on M2 Max, thanks a lot!

pcuenca avatar Feb 26 '23 09:02 pcuenca

Is it possible to report ram usage as well? Would have been interesting to see how ram is used and how it affects the performance

grapefroot avatar Feb 26 '23 16:02 grapefroot

Hi @grapefroot! Initially I was under the impression that RAM would be an important factor for performance (it is on iOS), but in our tests we did not notice any difference between 8 GB and 16 GB Macs: https://huggingface.co/blog/fast-mac-diffusers#performance-benchmarks. Things could be different if the computer is memory pressured when other apps are running, but am not sure how to test for that scenario. How would you go about measuring RAM usage?

pcuenca avatar Feb 26 '23 17:02 pcuenca

MacBook Pro 14-inch, 2023; Apple M2 Pro, 8-P-Core, 4-E-core, 19-GPU-core; 32GB Memory

Model: stable-diffusion-2-base Guidance Scale: 7.5 Step count: 25

GPU: 11.0s, 11.1s, 11.0s ANE: 10.6s, 10.8s, 10.7s, GPU+ANE: 10.5s, 10.4s, 10.7s

Low Power Mode: On GPU: 12.7s, 12.5s, 12.5s ANE: 11.3s, 11.2s, 11.1s GPU+ANE: 10.8s, 11.3s, 11.4s

mja avatar Mar 01 '23 14:03 mja

Hi folks, just wanted to throw in a suggestion: I think it would be better to include in this article that all the tests were made using a SPLIT_EINSUM model, since speeds of CPU_AND_GPU with ORIGINAL models are higher.

Source: personal, and with more examples in The Murus Team PromptToImage benchmarks.

Zabriskije avatar Mar 02 '23 01:03 Zabriskije

@Zabriskije the results in our table were done thus:

  • ORIGINAL attention when using compute units CPU_AND_GPU.
  • SPLIT_EINSUM attention for CPU_AND_ANE.

pcuenca avatar Mar 02 '23 07:03 pcuenca

@mja – Super interesting, thanks a lot!

pcuenca avatar Mar 02 '23 07:03 pcuenca

@pcuenca I'm a bit confused: isn't the model downloaded within the Diffusers 1.1 app SPLIT_EINSUM? Aren't the results reported in the article the same as the ones found here? Either way, I think it's still worth pointing out 🤓

Screenshot 2023-03-02 alle 11 28 26

Zabriskije avatar Mar 02 '23 10:03 Zabriskije

@Zabriskije We wanted the blog post to be easy, so we decided to hide some details. But yeah, maybe it's worth pointing it out :)

Barring bugs, the way the app is meant to work is:

  • It takes a look at your system and guesses the best compute combination for you. Currently, this yields either CPU+GPU or CPU+ANE.
  • The attention method is coupled with the compute units. GPU implies ORIGINAL, while ANE implies SPLIT_EINSUM.
  • We download the default model (Stable Diffusion 2) according to those decisions. In your case, it looks like it was CPU+ANE, and therefore split_einsum.
  • If you use the Advanced settings and select CPU+GPU instead, then the app tells you that it needs to download a different model (the original attention one), and it does that if you allow it to proceed.

Is this not what's happening in your case?

pcuenca avatar Mar 02 '23 10:03 pcuenca

@pcuenca Yup, it downloads the ORIGINAL model. Didn't know about that, thanks for the clarification :)

Zabriskije avatar Mar 02 '23 11:03 Zabriskije

Macbook Pro 14" with M2 Pro 12-Core CPU, 19-Core GPU, 32GB Unified Memory

Model: stable-diffusion-2-base Guidance Scale: 7.5 Step count: 25

  • GPU: 11.4, 11.2, 11.2
  • ANE: 10.3, 10.2, 10.3
  • GPU+ANE: 10.4, 10.3, 10.2

SerCeMan avatar Mar 02 '23 11:03 SerCeMan

Data point on an Intel Mac:

iMac Retina 5K, 2020 Processor: 3.6 GHz 10-Core Intel Core i9 GPU: AMD Radeon Pro 5700 XT 16 GB

Model: stable-diffusion-2-base Guidance Scale: 7.5 Step count: 25

  • GPU: 14.9s

pcuenca avatar Mar 05 '23 12:03 pcuenca

Macbook Pro 14" with M2 Max 12-Core CPU, 38-Core GPU, 16-core Neural Engine, 96GB Unified Memory

Model: stable-diffusion-2-1-base Guidance Scale: 7.5 Step count: 25

  • GPU: 6.5, 6.4, 6.5, 6.6, 6.5
  • ANE: 10.2, 10.3, 10.2, 10.3, 10.2
  • GPU+ANE: 9.9, 9.9, 10.0, 9.8, 10.0

lovelace avatar Mar 08 '23 05:03 lovelace