stable-diffusion.cpp
stable-diffusion.cpp copied to clipboard
Benchmark ?
Can you share how many seconds or it/s you do with your hardware (CPU/GPU/RAM) ?
ah ok, maybe #1
it is rather slow, q8 is the fastest i guess
sdcpptest.ipynb.txt
with 256x256px image size the q4_1 took about 8-9 minutes.
it is rather slow, q8 is the fastest i guess
Currently, it only supports running on the CPU. The CPU performance on Colab is not very strong, which results in slower processing. I'm currently working on optimizing its CPU performance and adding support for GPU acceleration.
My old Skylake PC takes about 38s per step for the 8bit model. (OpenBLAS doesn't seem to make a difference) the f32 model about 40s per step.
My old Laptop from 2016 needs 90s per step with the 8bit model.
Sample Picture test on M1 16G, 5-bit, 512x768, 15 steps, euler a The picture quality of q5_1 is quite good.
16-bit: memory < 3GB , 23 s/step 5-bit: memory < 2GB , 22.5 s/step
@czkoko are you using sd 1.5 ggml base model? I think your result is just too good for just an base model
@juniofaathir SD 1.5 base model can't generate such portrait, i use epicrealism
@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8
@juniofaathir There is no problem for me to use it. You can try the model I mentioned, or other training models, and filter merge models.
@czkoko i was able to convert "reliberate" but not realisticvision5.1 with baked VAE. if the civit ai model has a vae free version, you should be able to convert any of them. all major models have a huggingface link you should prefer over civit ai.
Linking my tests using cuda acceleration (cublas) here https://github.com/leejet/stable-diffusion.cpp/issues/6#issuecomment-1679580811
@czkoko you can use that model?? I've been trying some civitai model and converting it, but it didn't work like at #8
@juniofaathir Most of the SD 1.x models from Civitai are working fine, except for a few that include control model weights. I'm currently researching how to adapt these models.
@leejet hey, this implementation seems to use a very low amount of ram, lower and faster than using onnx f16 models. Thank you for your efforts!
It seems like the peak RAM usage stays at the minimum 1.4gb, when doing 256×384 images, using the current "q4_0" method!
Are you choosing a specific "recipe"?
like explained here: https://huggingface.co/blog/stable-diffusion-xl-coreml
The current composition of the model:
using these mixed quantization methods seems better than creating distilled models, they can be tailored and optimized for individual models..
Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.
~/stable-diffusion.cpp $ ./sd -m anything-v3-1-ggml-model-q4_0.bin -W 64 -H 64 -p "frog" --steps 1
WARNING: linker: /data/data/com.termux/files/home/stable-diffusion.cpp/sd: unsupported flags DT_FLAGS_1=0x8000001
[INFO] stable-diffusion.cpp:2191 - loading model from 'anything-v3-1-ggml-model-q4_0.bin'
[INFO] stable-diffusion.cpp:2216 - ftype: q4_0
[INFO] stable-diffusion.cpp:2261 - params ctx size = 1431.26 MB
[INFO] stable-diffusion.cpp:2401 - loading model from 'anything-v3-1-ggml-model-q4_0.bin' completed, taking 21.55s
[INFO] stable-diffusion.cpp:2482 - condition graph use 4.30MB of memory: static 1.37MB, dynamic = 2.93MB
[INFO] stable-diffusion.cpp:2482 - condition graph use 4.30MB of memory: static 1.37MB, dynamic = 2.93MB
[INFO] stable-diffusion.cpp:2824 - get_learned_condition completed, taking 15.42s
[INFO] stable-diffusion.cpp:2832 - start sampling
[INFO] stable-diffusion.cpp:2676 - step 1 sampling completed, taking 180.52s
[INFO] stable-diffusion.cpp:2691 - diffusion graph use 11.46MB of memory: static 2.82MB, dynamic = 8.63MB
[INFO] stable-diffusion.cpp:2837 - sampling completed, taking 180.62s
Killed
~/stable-diffusion.cpp $
Are you choosing a specific "recipe"?
This is determined by the characteristics of the ggml library, quantization can only be for the weight of the full connection layer, and the weight of the convolutional layer can only be f16.
- 60 seconds per step on Asus Zenbook UX430UNR 1.0. 4 threads.
- 30 seconds per step on Thinkpad T14 (AMD; Gen 1). 6 threads.
Tested with q4_0 of default v1.4 checkpoint.
Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.
@ClashSAN I used Stable Diffusion v1.5 but in half precision mode (fp16) only. It took around 55 minutes to generate a 512x512 image on my phone (Snapdragon 888 chipset with 8GB RAM).
./bin/sd -m ~/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat"
[INFO] stable-diffusion.cpp:2687 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO] stable-diffusion.cpp:2712 - ftype: f16
[INFO] stable-diffusion.cpp:2941 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO] stable-diffusion.cpp:2943 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 13.11s
[INFO] stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO] stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB)
[INFO] stable-diffusion.cpp:3552 - get_learned_condition completed, taking 3.01s
[INFO] stable-diffusion.cpp:3568 - start sampling
[INFO] stable-diffusion.cpp:3260 - step 1 sampling completed, taking 99.22s
[INFO] stable-diffusion.cpp:3260 - step 2 sampling completed, taking 110.11s
[INFO] stable-diffusion.cpp:3260 - step 3 sampling completed, taking 108.13s
[INFO] stable-diffusion.cpp:3260 - step 4 sampling completed, taking 103.45s
[INFO] stable-diffusion.cpp:3260 - step 5 sampling completed, taking 104.38s
[INFO] stable-diffusion.cpp:3260 - step 6 sampling completed, taking 102.38s
[INFO] stable-diffusion.cpp:3260 - step 7 sampling completed, taking 102.27s
[INFO] stable-diffusion.cpp:3260 - step 8 sampling completed, taking 108.72s
[INFO] stable-diffusion.cpp:3260 - step 9 sampling completed, taking 99.60s
[INFO] stable-diffusion.cpp:3260 - step 10 sampling completed, taking 99.32s
[INFO] stable-diffusion.cpp:3260 - step 11 sampling completed, taking 189.10s
[INFO] stable-diffusion.cpp:3260 - step 12 sampling completed, taking 214.05s
[INFO] stable-diffusion.cpp:3260 - step 13 sampling completed, taking 183.40s
[INFO] stable-diffusion.cpp:3260 - step 14 sampling completed, taking 203.24s
[INFO] stable-diffusion.cpp:3260 - step 15 sampling completed, taking 219.05s
[INFO] stable-diffusion.cpp:3260 - step 16 sampling completed, taking 219.44s
[INFO] stable-diffusion.cpp:3260 - step 17 sampling completed, taking 241.86s
[INFO] stable-diffusion.cpp:3260 - step 18 sampling completed, taking 215.12s
[INFO] stable-diffusion.cpp:3260 - step 19 sampling completed, taking 219.98s
[INFO] stable-diffusion.cpp:3260 - step 20 sampling completed, taking 220.93s
[INFO] stable-diffusion.cpp:3287 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO] stable-diffusion.cpp:3573 - sampling completed, taking 3163.83s
[INFO] stable-diffusion.cpp:3496 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO] stable-diffusion.cpp:3586 - decode_first_stage completed, taking 197.78s
[INFO] stable-diffusion.cpp:3600 - txt2img completed in 3364.61s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'
The project works well on Android so maybe @leejet wants to update the supported platform list.
AMD Ryzen 7 7700 test with q8_0 and f16
docker run --rm -v $PWD/models:/models -v $PWD/output/:/output sd --mode txt2img -m /models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "beduin riding a white bear in the desert, high quality, bokeh" -o /output/img2img_output.png
[INFO] stable-diffusion.cpp:3260 - step 20 sampling completed, taking 9.14s
[INFO] stable-diffusion.cpp:3280 - diffusion graph use 2022.78MB of memory: params 1399.01MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO] stable-diffusion.cpp:3573 - sampling completed, taking 178.27s
[INFO] stable-diffusion.cpp:3489 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO] stable-diffusion.cpp:3586 - decode_first_stage completed, taking 32.42s
[INFO] stable-diffusion.cpp:3594 - txt2img completed in 210.78s, use 2271.63MB of memory: peak params memory 1618.61MB, peak runtime memory 2177.12MB
save result image to '/output/img2img_output.png'
[INFO] stable-diffusion.cpp:3280 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO] stable-diffusion.cpp:3573 - sampling completed, taking 177.67s
[INFO] stable-diffusion.cpp:3489 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO] stable-diffusion.cpp:3586 - decode_first_stage completed, taking 32.74s
[INFO] stable-diffusion.cpp:3594 - txt2img completed in 210.51s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to '/output/img2img_output.png'
The project works well on Android so maybe @leejet wants to update the supported platform list.
Glad to hear that. I'll update the documentation later.
By the way, I've made a small optimization to make inference faster. I've tested it and it provides a ~10% speed improvement. Feel free to pull the latest code and give it a try. Just a reminder, don't forget to run the following code to update the submodule:
git pull origin master
git submodule update
@leejet do I need to make again?
@leejet do I need to
makeagain?
Yes, you need to make again
fyi, GGML is deprecated and replaced by GGUF, people might like to slow down in creating ggml's in advance :>
https://github.com/ggerganov/llama.cpp
I've created a new benchmark category in the discussion forum and posted some benchmark information. You can also share your benchmark information there if you'd like.
https://github.com/leejet/stable-diffusion.cpp/discussions/categories/benchmark
I'm really digging this project. It's pretty interesting how the timing and memory usage don't really change based on the precision - unlike llama.cpp where speed scales linearly with precision (so q8 is twice as fast as f16). Whether it's f32, f16, q8, q4, they all take about the same time and memory. Also want to say it's noticeably slower than the OpenVINO version of Stable Diffusion. So, there's definitely room for improvement.
Here's something interesting: I almost got a full generation on a 2gb 32bit mobile phone, before running out of ram. If someone has a better 32bit arm device, please see if the generation is successful.
@ClashSAN I used Stable Diffusion v1.5 but in half precision mode (fp16) only. It took around 55 minutes to generate a 512x512 image on my phone (Snapdragon 888 chipset with 8GB RAM).
./bin/sd -m ~/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat" [INFO] stable-diffusion.cpp:2687 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' [INFO] stable-diffusion.cpp:2712 - ftype: f16 [INFO] stable-diffusion.cpp:2941 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB) [INFO] stable-diffusion.cpp:2943 - loading model from '/data/data/com.termux/files/home/storage/shared/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 13.11s [INFO] stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB) [INFO] stable-diffusion.cpp:3066 - condition graph use 239.58MB of memory: params 235.01MB, runtime 4.57MB (static 1.64MB, dynamic 2.93MB) [INFO] stable-diffusion.cpp:3552 - get_learned_condition completed, taking 3.01s [INFO] stable-diffusion.cpp:3568 - start sampling [INFO] stable-diffusion.cpp:3260 - step 1 sampling completed, taking 99.22s [INFO] stable-diffusion.cpp:3260 - step 2 sampling completed, taking 110.11s [INFO] stable-diffusion.cpp:3260 - step 3 sampling completed, taking 108.13s [INFO] stable-diffusion.cpp:3260 - step 4 sampling completed, taking 103.45s [INFO] stable-diffusion.cpp:3260 - step 5 sampling completed, taking 104.38s [INFO] stable-diffusion.cpp:3260 - step 6 sampling completed, taking 102.38s [INFO] stable-diffusion.cpp:3260 - step 7 sampling completed, taking 102.27s [INFO] stable-diffusion.cpp:3260 - step 8 sampling completed, taking 108.72s [INFO] stable-diffusion.cpp:3260 - step 9 sampling completed, taking 99.60s [INFO] stable-diffusion.cpp:3260 - step 10 sampling completed, taking 99.32s [INFO] stable-diffusion.cpp:3260 - step 11 sampling completed, taking 189.10s [INFO] stable-diffusion.cpp:3260 - step 12 sampling completed, taking 214.05s [INFO] stable-diffusion.cpp:3260 - step 13 sampling completed, taking 183.40s [INFO] stable-diffusion.cpp:3260 - step 14 sampling completed, taking 203.24s [INFO] stable-diffusion.cpp:3260 - step 15 sampling completed, taking 219.05s [INFO] stable-diffusion.cpp:3260 - step 16 sampling completed, taking 219.44s [INFO] stable-diffusion.cpp:3260 - step 17 sampling completed, taking 241.86s [INFO] stable-diffusion.cpp:3260 - step 18 sampling completed, taking 215.12s [INFO] stable-diffusion.cpp:3260 - step 19 sampling completed, taking 219.98s [INFO] stable-diffusion.cpp:3260 - step 20 sampling completed, taking 220.93s [INFO] stable-diffusion.cpp:3287 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB) [INFO] stable-diffusion.cpp:3573 - sampling completed, taking 3163.83s [INFO] stable-diffusion.cpp:3496 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB) [INFO] stable-diffusion.cpp:3586 - decode_first_stage completed, taking 197.78s [INFO] stable-diffusion.cpp:3600 - txt2img completed in 3364.61s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB save result image to 'output.png'
The project works well on Android so maybe @leejet wants to update the supported platform list.
you can try mnn-diffusion, on andriod 8Gen3, it can reach 2s/iter, and 1s/iter on Apple M3, with 512x512 images.
reference: https://zhuanlan.zhihu.com/p/721798565 source code: https://github.com/alibaba/MNN/tree/master
