stable-diffusion.cpp Support for Flux

New diffusion model - https://blackforestlabs.ai/announcing-black-forest-labs/

Reference implementation: https://github.com/comfyanonymous/ComfyUI/commit/1589b58d3e29e44623a1f3f595917b98f2301c3e

Aug 01 '24 20:08 extremeheat

There's a reference diffusers config in a PR as well: https://huggingface.co/black-forest-labs/FLUX.1-dev/discussions/3/files

Aug 02 '24 00:08 65a

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Aug 02 '24 16:08 stduhpf

The "schnell" distillation seems like a good candidate though.

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Aug 02 '24 17:08 Green-Sky

12B is massive, lower than f16 quants could become more popular, f8 or even q5 (upd: corrected)

Aug 04 '24 13:08 diimdeep

No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc. fp16 ~= 12 * 2 = 24GB memory 8 bit quant ~= 12 * 1= 12GB 4 bit quant ~= 12 * 0.5 = 6GB 5 bit ~= 12 * (5/8) = 7.5GB

Aug 04 '24 19:08 extremeheat

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Can SD.cpp partially offload a model to GPU? I was unable to do this. Can you give a hint hot to do this?

One more upvote for FLUX support.

Aug 06 '24 12:08 red-scorp

No, 12B parameters at fp32 is 12 * 4 bytes = 48GB memory. Not including clip/t5/vae/etc. fp16 ~= 12 * 2 = 24GB memory 8 bit quant ~= 12 * 1= 12GB 4 bit quant ~= 12 * 0.5 = 6GB 5 bit ~= 12 * (5/8) = 7.5GB

Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.

Aug 06 '24 12:08 red-scorp

Can image generation models be quantized down to 4-5 bits? I saw it done to LLMs with mixed results, but never saw it working with SD and Co.

stable-diffusion.cpp can quantize models using --type command line argument

Aug 06 '24 12:08 SkutteOleg

The "schnell" distillation seems like a good candidate though.

That's probably the best open model so far, but it's pretty big. I'd love to be able to use it quantized and with partial GPU offload.

Different stuff. It's more like the LCM models on SD, it takes less steps to make a picture, with minor quality loss over the standard version.

12B is massive, lower than f16 quants could become more popular, f8 or even q5 (upd: corrected)

There's already an unofficial FP8 version wich works on ComfyUI https://huggingface.co/Kijai/flux-fp8

But i don't know if comfy is able to do interference on more aggressive quantizations

Anyway.... +1 for flux support. Or for any new big image model.

As those model become bigger and bigger, running a quantized version could be the only way to run them on consumer-grade GPUs, and stable-diffusion.cpp could become a good solution.

Aug 06 '24 17:08 DGdev91

indeed flux is amazing, after working with it i wanna use it with cpp too if possible

@FSSRepo if i remmeber correctly u were working on q2 and k variant, please work on it if u can, we can quantize it to something like 3-4 gb maybe @leejet please support for lora for quantized models if possible, we really need for flux so we dont need to download and quantize new models all the time

Aug 09 '24 13:08 Amin456789

this repo has q4 flux schnell: https://huggingface.co/city96/FLUX.1-schnell-gguf/tree/main. Agree would be great to get support, my computer is not even close to being able to run non quantized

Aug 17 '24 22:08 teddybear082

There are quantized files already (GGUF) claimed to have been made with stable-diffusion.cpp. The smallest is only 4Gb but does not seems to work (gguf_init_from_file: tensor 'double_blocks.0.img_attn.norm.key_norm.scale' of type 10 (q2_K) number of elements (128) is not a multiple of block size (256)) The Q4_0 model (second smallest) is not working as well get sd version from file failed: '../../Downloads/flux1-schnell-Q4_0.gguf'

https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main

However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !

Aug 19 '24 20:08 PierreSnell-Appox

Indeed, in the last weeks lots of developers made several attempts for using a quantized version of flux in both ComfyUI and StableDiffusion webui Forge. Currently the most popular methods seems to be fp8 and nf4, but i've seen mamy experiments with gguf too.

However, the ability to have a small REST/API like Llama.cpp would be amazing to host thins kind of models !

Both ComfyUI and StableDiffusion Webui (well, of course Forge too) already have an API, and except from some plugin wich try to integrate them with program like Photoshop or Krita, i didn't saw many projects using them.

It can still be an interesting feature, but i don't think it should be the priority. Also, it can be developed as a separate project.

Aug 20 '24 08:08 DGdev91

The GGUF experiments aren't using proper ggml though. They are just using GGUF as a way to compress the weights, and they are dequantizing on the fly during inference, which is very inefficient.

Aug 20 '24 08:08 stduhpf

Hi all!

I am an author of this repo https://huggingface.co/aifoundry-org/FLUX.1-schnell-Quantized/tree/main Right now I'm trying to add FLUX support in this forked repo: https://github.com/robolamp/stable-diffusion-flux.cpp and if it'll work, I'd like to try to merge it back into stable-diffusion.cpp

Since SD3 is already supported and FLUX has close architecture (as far as I know at least), I hope it'll not be too complicated.

Aug 20 '24 18:08 robolamp

Flux support has been added. https://github.com/leejet/stable-diffusion.cpp/pull/356

Aug 21 '24 13:08 leejet

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.

Aug 21 '24 13:08 leejet

Although the architecture is similar to sd3, flux actually has a lot of additional things to implement, so adding flux support took me a bit longer.

Hi there,thanks for the flux support. Has there been a noticeable difference in speed in your tests, in comparison with the compressed GGUF versions for other UI's?

Aug 21 '24 16:08 MGTRIDER

Nice! Finally able to run it, don't have enough VRAM so really appreciate it. Imagine probably take forever to generate one image, but it's something at least

Aug 21 '24 19:08 nonetrix

Support has been merge to master, so grab a latest release and give it a spin. See docs/flux.md for how do run it. Also check out https://huggingface.co/Green-Sky/flux.1-schnell-GGUF/ for some prequantized parts, I can recommend using q8_0 for basically lossless for unet aswell as t5xxl. Also the f16 vae (ae) seems to be perceptually lossless too.

Aug 24 '24 08:08 Green-Sky

is anyone else getting "error: unknown argument: --clip_1"? I'm using sd.exe for cuda12 and windows x64

EDIT: LOL it's --clip_l (lowercase l, not a 1)

Aug 24 '24 12:08 teddybear082

This is awesome thank you!!! Used green sky's schnell q4_k, ae_f16, clip_l_q8, and t5xxl q4_k, with a rtx 3070.

output

Aug 24 '24 13:08 teddybear082

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

Aug 24 '24 14:08 teddybear082

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

I don't think so, sadly. You can do multiple renders with the same prompt by adding the -b [n] argument (replace [n] with the number of images you want). But if you want to use another prompt, you'd have to reload everything.

Aug 24 '24 14:08 stduhpf

Ok sorry this is probably going to sound stupid, but does running the exe directly each time take longer than some other method? It seems like a lot of the steps do not pertain to processing the specific prompt so I was wondering if there is a way to keep the models “in memory” so to speak in between prompts so that after the first generation, other generations during the same session go faster because they are not trying to do all the steps from scratch each time? I don’t know if I’m making any sense. But I’m comparing to, say, koboldcpp, where all the loading of a llama model happens at first and once it’s loaded which takes some extra time, all generations after that are pretty quick.

koboldcpp is a user program built on top of the library llama.cpp. You are looking for a user program built on top of this library, stable-diffusion.cpp. That's not the point of this repo, but someone could build this (or maybe already has?).

Aug 25 '24 10:08 0cc4m

kobold.cpp is a user program built on top of this library. it includes a stable diffusion ui, not as many features in it as comfyui or similar, but its usable. in addition, you can generate directly from the normal koboldcpp lite main ui, or you can do it through most interfaces like sillytavern.

Aug 25 '24 11:08 yggdrasil75

I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)? Edit: 1024x1024 resolution

Aug 28 '24 07:08 ghost

I'm very sorry for my stupid question, but can someone explain why it's so slow when q8_0 or q4_k is used? About 18-19 sec per iteration, when fp8 model in ComfyUI was giving about 6-7 sec/it on the same GPU (RX 7600 XT)? Edit: 1024x1024 resolution

https://github.com/leejet/stable-diffusion.cpp/issues/323#issuecomment-2298307764

Using stable-diffusion.cpp should be much faster than ComfyUI when it comes to GGUF.

Aug 28 '24 09:08 stduhpf

I thought that comment was related to Comfy+GGUF which I don't try, I tried Comfy with fp8 model.

Aug 28 '24 11:08 ghost

Ah I misunderstood what you meant. You're getting worse performance with stable-diffusion.cpp+GGUF compared to Comfy+fp8? Both using Rocm?

Aug 28 '24 11:08 stduhpf

stable-diffusion.cpp stable-diffusion.cpp copied to clipboard

Support for Flux

stable-diffusion.cpp
stable-diffusion.cpp copied to clipboard