stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

Benchmark ?

Open grigio opened this issue 2 years ago • 26 comments

Can you share how many seconds or it/s you do with your hardware (CPU/GPU/RAM) ?

grigio avatar Aug 20 '23 22:08 grigio

ah ok, maybe #1

grigio avatar Aug 20 '23 23:08 grigio

it is rather slow, q8 is the fastest i guess image sdcpptest.ipynb.txt

mjkrakowski avatar Aug 20 '23 23:08 mjkrakowski

with 256x256px image size the q4_1 took about 8-9 minutes. image

mjkrakowski avatar Aug 20 '23 23:08 mjkrakowski

it is rather slow, q8 is the fastest i guess

Currently, it only supports running on the CPU. The CPU performance on Colab is not very strong, which results in slower processing. I'm currently working on optimizing its CPU performance and adding support for GPU acceleration.

leejet avatar Aug 21 '23 00:08 leejet

My old Skylake PC takes about 38s per step for the 8bit model. (OpenBLAS doesn't seem to make a difference) the f32 model about 40s per step.

My old Laptop from 2016 needs 90s per step with the 8bit model.

h3ndrik avatar Aug 21 '23 14:08 h3ndrik

5-bit

Sample Picture test on M1 16G, 5-bit, 512x768, 15 steps, euler a The picture quality of q5_1 is quite good.

16-bit: memory < 3GB , 23 s/step 5-bit: memory < 2GB , 22.5 s/step

czkoko avatar Aug 21 '23 15:08 czkoko

@czkoko are you using sd 1.5 ggml base model? I think your result is just too good for just an base model

juniofaathir avatar Aug 21 '23 18:08 juniofaathir

@juniofaathir SD 1.5 base model can't generate such portrait, i use epicrealism

czkoko avatar Aug 21 '23 18:08 czkoko

@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8

juniofaathir avatar Aug 21 '23 18:08 juniofaathir

@juniofaathir There is no problem for me to use it. You can try the model I mentioned, or other training models, and filter merge models.

czkoko avatar Aug 21 '23 19:08 czkoko

@czkoko i was able to convert "reliberate" but not realisticvision5.1 with baked VAE. if the civit ai model has a vae free version, you should be able to convert any of them. all major models have a huggingface link you should prefer over civit ai.

mjkrakowski avatar Aug 21 '23 20:08 mjkrakowski

Linking my tests using cuda acceleration (cublas) here https://github.com/leejet/stable-diffusion.cpp/issues/6#issuecomment-1679580811

klosax avatar Aug 21 '23 22:08 klosax

@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8

@juniofaathir Most of the SD 1.x models from Civitai are working fine, except for a few that include control model weights. I'm currently researching how to adapt these models.

leejet avatar Aug 22 '23 12:08 leejet

@leejet hey, this implementation seems to use a very low amount of ram, lower and faster than using onnx f16 models. Thank you for your efforts!

It seems like the peak RAM usage stays at the minimum 1.4gb, when doing 256×384 images, using the current "q4_0" method!

Are you choosing a specific "recipe"?

like explained here: https://huggingface.co/blog/stable-diffusion-xl-coreml

The current composition of the model:

pie-chart

pie-chart (1)

pie-chart

using these mixed quantization methods seems better than creating distilled models, they can be tailored and optimized for individual models..

ClashSAN avatar Aug 22 '23 21:08 ClashSAN

Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.

 ~/stable-diffusion.cpp $ ./sd -m anything-v3-1-ggml-model-q4_0.bin -W 64 -H 64 -p "frog" --steps 1
WARNING: linker: /data/data/com.termux/files/home/stable-diffusion.cpp/sd: unsupported flags DT_FLAGS_1=0x8000001
[INFO]  stable-diffusion.cpp:2191 - loading model from 'anything-v3-1-ggml-model-q4_0.bin'
[INFO]  stable-diffusion.cpp:2216 - ftype: q4_0
[INFO]  stable-diffusion.cpp:2261 - params ctx size =  1431.26 MB
[INFO]  stable-diffusion.cpp:2401 - loading model from 'anything-v3-1-ggml-model-q4_0.bin' completed, taking 21.55s
[INFO]  stable-diffusion.cpp:2482 - condition graph use 4.30MB of memory: static 1.37MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2482 - condition graph use 4.30MB of memory: static 1.37MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2824 - get_learned_condition completed, taking 15.42s
[INFO]  stable-diffusion.cpp:2832 - start sampling
[INFO]  stable-diffusion.cpp:2676 - step 1 sampling completed, taking 180.52s
[INFO]  stable-diffusion.cpp:2691 - diffusion graph use 11.46MB of memory: static 2.82MB, dynamic = 8.63MB
[INFO]  stable-diffusion.cpp:2837 - sampling completed, taking 180.62s
Killed
~/stable-diffusion.cpp $

ClashSAN avatar Aug 22 '23 23:08 ClashSAN

Are you choosing a specific "recipe"?

This is determined by the characteristics of the ggml library, quantization can only be for the weight of the full connection layer, and the weight of the convolutional layer can only be f16.

leejet avatar Aug 22 '23 23:08 leejet

  • 60 seconds per step on Asus Zenbook UX430UNR 1.0. 4 threads.
  • 30 seconds per step on Thinkpad T14 (AMD; Gen 1). 6 threads.

Tested with q4_0 of default v1.4 checkpoint.

walking-octopus avatar Aug 24 '23 08:08 walking-octopus

Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.

@ClashSAN I used Stable Diffusion v1.5 but in half precision mode (fp16) only. It took around 55 minutes to generate a 512x512 image on my phone (Snapdragon 888 chipset with 8GB RAM).

./bin/sd -m ~/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat"
[INFO]  stable-diffusion.cpp:2687 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2712 - ftype: f16
[INFO]  stable-diffusion.cpp:2941 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2943 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 13.11s
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3552 - get_learned_condition completed, taking 3.01s
[INFO]  stable-diffusion.cpp:3568 - start sampling
[INFO]  stable-diffusion.cpp:3260 - step 1 sampling completed, taking 99.22s
[INFO]  stable-diffusion.cpp:3260 - step 2 sampling completed, taking 110.11s
[INFO]  stable-diffusion.cpp:3260 - step 3 sampling completed, taking 108.13s
[INFO]  stable-diffusion.cpp:3260 - step 4 sampling completed, taking 103.45s
[INFO]  stable-diffusion.cpp:3260 - step 5 sampling completed, taking 104.38s
[INFO]  stable-diffusion.cpp:3260 - step 6 sampling completed, taking 102.38s
[INFO]  stable-diffusion.cpp:3260 - step 7 sampling completed, taking 102.27s
[INFO]  stable-diffusion.cpp:3260 - step 8 sampling completed, taking 108.72s
[INFO]  stable-diffusion.cpp:3260 - step 9 sampling completed, taking 99.60s
[INFO]  stable-diffusion.cpp:3260 - step 10 sampling completed, taking 99.32s
[INFO]  stable-diffusion.cpp:3260 - step 11 sampling completed, taking 189.10s
[INFO]  stable-diffusion.cpp:3260 - step 12 sampling completed, taking 214.05s
[INFO]  stable-diffusion.cpp:3260 - step 13 sampling completed, taking 183.40s
[INFO]  stable-diffusion.cpp:3260 - step 14 sampling completed, taking 203.24s
[INFO]  stable-diffusion.cpp:3260 - step 15 sampling completed, taking 219.05s
[INFO]  stable-diffusion.cpp:3260 - step 16 sampling completed, taking 219.44s
[INFO]  stable-diffusion.cpp:3260 - step 17 sampling completed, taking 241.86s
[INFO]  stable-diffusion.cpp:3260 - step 18 sampling completed, taking 215.12s
[INFO]  stable-diffusion.cpp:3260 - step 19 sampling completed, taking 219.98s
[INFO]  stable-diffusion.cpp:3260 - step 20 sampling completed, taking 220.93s
[INFO]  stable-diffusion.cpp:3287 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 3163.83s
[INFO]  stable-diffusion.cpp:3496 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 197.78s
[INFO]  stable-diffusion.cpp:3600 - txt2img completed in 3364.61s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

output

The project works well on Android so maybe @leejet wants to update the supported platform list.

nviet avatar Aug 24 '23 10:08 nviet

AMD Ryzen 7 7700 test with q8_0 and f16

docker run --rm -v $PWD/models:/models -v $PWD/output/:/output sd --mode txt2img -m /models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "beduin riding a white bear in the desert, high quality, bokeh"  -o /output/img2img_output.png
[INFO]  stable-diffusion.cpp:3260 - step 20 sampling completed, taking 9.14s
[INFO]  stable-diffusion.cpp:3280 - diffusion graph use 2022.78MB of memory: params 1399.01MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 178.27s
[INFO]  stable-diffusion.cpp:3489 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 32.42s
[INFO]  stable-diffusion.cpp:3594 - txt2img completed in 210.78s, use 2271.63MB of memory: peak params memory 1618.61MB, peak runtime memory 2177.12MB
save result image to '/output/img2img_output.png'

[INFO]  stable-diffusion.cpp:3280 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 177.67s
[INFO]  stable-diffusion.cpp:3489 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 32.74s
[INFO]  stable-diffusion.cpp:3594 - txt2img completed in 210.51s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to '/output/img2img_output.png'


img2img_output

grigio avatar Aug 24 '23 16:08 grigio

The project works well on Android so maybe @leejet wants to update the supported platform list.

Glad to hear that. I'll update the documentation later.

leejet avatar Aug 24 '23 16:08 leejet

By the way, I've made a small optimization to make inference faster. I've tested it and it provides a ~10% speed improvement. Feel free to pull the latest code and give it a try. Just a reminder, don't forget to run the following code to update the submodule:

git pull origin master
git submodule update

leejet avatar Aug 24 '23 16:08 leejet

@leejet do I need to make again?

juniofaathir avatar Aug 24 '23 23:08 juniofaathir

@leejet do I need to make again?

Yes, you need to make again

leejet avatar Aug 25 '23 00:08 leejet

fyi, GGML is deprecated and replaced by GGUF, people might like to slow down in creating ggml's in advance :>

https://github.com/ggerganov/llama.cpp

mjkrakowski avatar Aug 25 '23 12:08 mjkrakowski

I've created a new benchmark category in the discussion forum and posted some benchmark information. You can also share your benchmark information there if you'd like.

https://github.com/leejet/stable-diffusion.cpp/discussions/categories/benchmark

leejet avatar Aug 26 '23 10:08 leejet

I'm really digging this project. It's pretty interesting how the timing and memory usage don't really change based on the precision - unlike llama.cpp where speed scales linearly with precision (so q8 is twice as fast as f16). Whether it's f32, f16, q8, q4, they all take about the same time and memory. Also want to say it's noticeably slower than the OpenVINO version of Stable Diffusion. So, there's definitely room for improvement.

RedAndr avatar Sep 27 '23 19:09 RedAndr

Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.

@ClashSAN I used Stable Diffusion v1.5 but in half precision mode (fp16) only. It took around 55 minutes to generate a 512x512 image on my phone (Snapdragon 888 chipset with 8GB RAM).

./bin/sd -m ~/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat"
[INFO]  stable-diffusion.cpp:2687 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2712 - ftype: f16
[INFO]  stable-diffusion.cpp:2941 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2943 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 13.11s
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3552 - get_learned_condition completed, taking 3.01s
[INFO]  stable-diffusion.cpp:3568 - start sampling
[INFO]  stable-diffusion.cpp:3260 - step 1 sampling completed, taking 99.22s
[INFO]  stable-diffusion.cpp:3260 - step 2 sampling completed, taking 110.11s
[INFO]  stable-diffusion.cpp:3260 - step 3 sampling completed, taking 108.13s
[INFO]  stable-diffusion.cpp:3260 - step 4 sampling completed, taking 103.45s
[INFO]  stable-diffusion.cpp:3260 - step 5 sampling completed, taking 104.38s
[INFO]  stable-diffusion.cpp:3260 - step 6 sampling completed, taking 102.38s
[INFO]  stable-diffusion.cpp:3260 - step 7 sampling completed, taking 102.27s
[INFO]  stable-diffusion.cpp:3260 - step 8 sampling completed, taking 108.72s
[INFO]  stable-diffusion.cpp:3260 - step 9 sampling completed, taking 99.60s
[INFO]  stable-diffusion.cpp:3260 - step 10 sampling completed, taking 99.32s
[INFO]  stable-diffusion.cpp:3260 - step 11 sampling completed, taking 189.10s
[INFO]  stable-diffusion.cpp:3260 - step 12 sampling completed, taking 214.05s
[INFO]  stable-diffusion.cpp:3260 - step 13 sampling completed, taking 183.40s
[INFO]  stable-diffusion.cpp:3260 - step 14 sampling completed, taking 203.24s
[INFO]  stable-diffusion.cpp:3260 - step 15 sampling completed, taking 219.05s
[INFO]  stable-diffusion.cpp:3260 - step 16 sampling completed, taking 219.44s
[INFO]  stable-diffusion.cpp:3260 - step 17 sampling completed, taking 241.86s
[INFO]  stable-diffusion.cpp:3260 - step 18 sampling completed, taking 215.12s
[INFO]  stable-diffusion.cpp:3260 - step 19 sampling completed, taking 219.98s
[INFO]  stable-diffusion.cpp:3260 - step 20 sampling completed, taking 220.93s
[INFO]  stable-diffusion.cpp:3287 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3573 - sampling completed, taking 3163.83s
[INFO]  stable-diffusion.cpp:3496 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3586 - decode_first_stage completed, taking 197.78s
[INFO]  stable-diffusion.cpp:3600 - txt2img completed in 3364.61s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

output

The project works well on Android so maybe @leejet wants to update the supported platform list.

you can try mnn-diffusion, on andriod 8Gen3, it can reach 2s/iter, and 1s/iter on Apple M3, with 512x512 images.

reference: https://zhuanlan.zhihu.com/p/721798565 source code: https://github.com/alibaba/MNN/tree/master

bitxsw93 avatar Oct 10 '24 03:10 bitxsw93