stable-diffusion.cpp
stable-diffusion.cpp copied to clipboard
Support for Flux
New diffusion model - https://blackforestlabs.ai/announcing-black-forest-labs/
Reference implementation: https://github.com/comfyanonymous/ComfyUI/commit/1589b58d3e29e44623a1f3f595917b98f2301c3e
There's a reference diffusers config in a PR as well: https://huggingface.co/black-forest-labs/FLUX.1-dev/discussions/3/files
That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.
The "schnell" distillation seems like a good candidate though.
That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.
12B is massive, lower than f16 quants could become more popular, f8 or even q5 (upd: corrected)
No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc. fp16 ~= 12 * 2 = 24GB memory 8 bit quant ~= 12 * 1= 12GB 4 bit quant ~= 12 * 0.5 = 6GB 5 bit ~= 12 * (5/8) = 7.5GB
That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.
Can SD.cpp partially offload a model to GPU? I was unable to do this. Can you give a hint hot to do this?
- One more upvote for FLUX support.
No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc. fp16 ~= 12 * 2 = 24GB memory 8 bit quant ~= 12 * 1= 12GB 4 bit quant ~= 12 * 0.5 = 6GB 5 bit ~= 12 * (5/8) = 7.5GB
Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.
Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.
stable-diffusion.cpp can quantize models using --type command line argument
The "schnell" distillation seems like a good candidate though.
That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.
Different stuff. It's more like the LCM models on SD, it takes less steps to make a picture, with minor quality loss over the standard version.
12B is massive, lower than f16 quants could become more popular, f8 or even q5 (upd: corrected)
There's already an unofficial FP8 version wich works on ComfyUI https://huggingface.co/Kijai/flux-fp8
But i don't know if comfy is able to do interference on more aggressive quantizations
Anyway.... +1 for flux support. Or for any new big image model.
As those model become bigger and bigger, running a quantized version could be the only way to run them on consumer-grade GPUs, and stable-diffusion.cpp could become a good solution.
indeed flux is amazing, after working with it i wanna use it with cpp too if possible
@FSSRepo if i remmeber correctly u were working on q2 and k variant, please work on it if u can, we can quantize it to something like 3-4 gb maybe @leejet please support for lora for quantized models if possible, we really need for flux so we dont need to download and quantize new models all the time
this repo has q4 flux schnell: https://huggingface.co/city96/FLUX.1-schnell-gguf/tree/main. Agree would be great to get support, my computer is not even close to being able to run non quantized
There are quantized files already (GGUF) claimed to have been made with stable-diffusion.cpp.
The smallest is only 4Gb but does not seems to work (gguf_init_from_file: tensor 'double_blocks.0.img_attn.norm.key_norm.scale' of type 10 (q2_K) number of elements (128) is not a multiple of block size (256))
The Q4_0 model (second smallest) is not working as well get sd version from file failed: '../../Downloads/flux1-schnell-Q4_0.gguf'
https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main
However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !
Indeed, in the last weeks lots of developers made several attempts for using a quantized version of flux in both ComfyUI and StableDiffusion webui Forge. Currently the most popular methods seems to be fp8 and nf4, but i've seen mamy experiments with gguf too.
However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !
Both ComfyUI and StableDiffusion Webui (well, of course Forge too) already have an API, and except from some plugin wich try to integrate them with program like Photoshop or Krita, i didn't saw many projects using them.
It can still be an interesting feature, but i don't think it should be the priority. Also, it can be developed as a separate project.
The GGUF experiments aren't using proper ggml though. They are just using GGUF as a way to compress the weights, and they are dequantizing on the fly during inference, which is very inefficient.
Hi all!
I am an author of this repo https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main Right now I'm trying to add FLUX support in this forked repo: https://github.com/robolamp/stable-diffusion-flux.cpp and if it'll work, I'd like to try to merge it back into stable-diffusion.cpp
Since SD3 is already supported and FLUX has close architecture (as far as I know at least), I hope it'll not be too complicated.
Flux support has been added. https://github.com/leejet/stable-diffusion.cpp/pull/356
Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.
Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.
Hi there,thanks for the flux support. Has there been a noticeable difference in speed in your tests, in comparison with the compressed GGUF versions for other UI's?
Nice! Finally able to run it, don't have enough VRAM so really appreciate it. Imagine probably take forever to generate one image, but it's something at least
Support has been merge to master, so grab a latest release and give it a spin.
See docs/flux.md for how do run it.
Also check out https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/ for some prequantized parts, I can recommend using q8_0 for basically lossless for unet aswell as t5xxl. Also the f16 vae (ae) seems to be perceptually lossless too.
is anyone else getting "error: unknown argument: --clip_1"? I'm using sd.exe for cuda12 and windows x64
EDIT: LOL it's --clip_l (lowercase l, not a 1)
This is awesome thank you!!! Used green sky's schnell q4_k, ae_f16, clip_l_q8, and t5xxl q4_k, with a rtx 3070.
Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.
Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.
I don't think so, sadly. You can do multiple renders with the same prompt by adding the -b [n] argument (replace [n] with the number of images you want). But if you want to use another prompt, you'd have to reload everything.
Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.
koboldcpp is a user program built on top of the library llama.cpp. You are looking for a user program built on top of this library, stable-diffusion.cpp. That's not the point of this repo, but someone could build this (or maybe already has?).
kobold.cpp is a user program built on top of this library. it includes a stable diffusion ui, not as many features in it as comfyui or similar, but its usable. in addition, you can generate directly from the normal koboldcpp lite main ui, or you can do it through most interfaces like sillytavern.
I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)? Edit: 1024x1024 resolution
I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)? Edit: 1024x1024 resolution
https://github.com/leejet/stable-diffusion.cpp/issues/323#issuecomment-2298307764
Using stable-diffusion.cpp should be much faster than ComfyUI when it comes to GGUF.
I thought that comment was related to Comfy+GGUF which I don't try, I tried Comfy with fp8 model.
Ah I misunderstood what you meant. You're getting worse performance with stable-diffusion.cpp+GGUF compared to Comfy+fp8? Both using Rocm?