ml-stable-diffusion icon indicating copy to clipboard operation
ml-stable-diffusion copied to clipboard

[Performance]A17 PRO ANE has much more computing power than M2, but its stable diffusion performance is worse than M2?

Open AndreaChiChengdu opened this issue 2 years ago • 7 comments

image Hello, as the title indicates and snapshot from the benchmark of stable diffusion xl in this project, it can be seen that the performance of A17Pro 35T ANE is worse than M2 ANE 15.8T. Is there any other reason besides the large memory bandwith gap?

It seems that the A17Pro 35T's high computing power is not being used very effectively at all.

AndreaChiChengdu avatar Oct 16 '23 06:10 AndreaChiChengdu

As a supplement, I used diffusers app to inference SD1.5 using ANE on my iPhone15 pro(A17Pro), and I found that E2E times were about the same as M2. And the A17Pro has very limited improvement over the A16 17TOPs ANE(with the same mem bandwith)。Is it because of memory bound?

I am very confused. How can I find the answer to this question?

AndreaChiChengdu avatar Oct 16 '23 07:10 AndreaChiChengdu

https://www.cpu-monkey.com/en/compare_cpu-apple_a17_pro-vs-apple_m2_pro_12_cpu_19_gpu I think you might consider about their cores.

TimYao18 avatar Nov 10 '23 09:11 TimYao18

https://www.cpu-monkey.com/en/compare_cpu-apple_a17_pro-vs-apple_m2_pro_12_cpu_19_gpu I think you might consider about their cores.

image unet runs on ANE, as can be seen from the specifications. Both the A17Pro and M2 ANE have 16 cores, but the A17Pro is much more powerful, 35T VS 15.8T, but the performance is worse. It's incredible. any suggestions? @TimYao18 @pcuenca

AndreaChiChengdu avatar Nov 10 '23 09:11 AndreaChiChengdu

You cannot just see the "ANE" part. The compute Unit is "CPU and NE". Maybe the CPU part add M2 score. Or just Apple got screwed.

When using CPU + ANE, CPU will also use a lot power. Screenshot 2023-11-10 at 5 42 56 PM

TimYao18 avatar Nov 10 '23 09:11 TimYao18

You cannot just see the "ANE" part. The compute Unit is "CPU and NE". Maybe the CPU part add M2 score. Or just Apple got screwed.

When using CPU + ANE, CPU will also use a lot power. Screenshot 2023-11-10 at 5 42 56 PM

The CPU will always have power consumption, that's not the point.

I encourage you to use the instrument coreml template for further analysis, you will see that almost all the unet operators(99.89%) are executed on ANE. The cpu has only a very small amount of workload. It is also very small compared to the latency of ANE computation.

Our view from a more microscopic decomposition point of view is that the time of unet ANE computation is already slightly slower than M2. anyway,thanks for your reply,buddy, have a great weekend~

AndreaChiChengdu avatar Nov 10 '23 09:11 AndreaChiChengdu

截屏2023-11-10 18 24 10

AndreaChiChengdu avatar Nov 10 '23 10:11 AndreaChiChengdu

Thank you for your information.

I met similar problem on M2 Pro and M2 that M2 Pro runs slower than M2, and when using computeUnit==All will run twice slower than CPU_AND_NE. Maybe I can use this to check if M2 Pro has something wrong that it runs slower than M2 when it runs through unet.

截屏2023-11-10 18 24 10

TimYao18 avatar Nov 11 '23 01:11 TimYao18