stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

Collaboration/Sponsorship: Improving SDXL Inference Performance in stable-diffusion.cpp

Open JustMaier opened this issue 4 months ago • 26 comments

We’re exploring stable-diffusion.cpp at Civitai to better serve SDXL requests, but we’ve found inference times still need some work. We’d love to help improve this and are open to sponsoring development or collaborating with contributors here.

@Green-Sky @wbruna @stduhpf - since you’ve done great work on this project, I’d love to hear if you’d be interested in discussing ways to optimize performance together.

We generate millions of images a day but using raw comfy is really inefficient and even most python-based solutions have issues... Our aim is to maximize GPU use and reduce model swap time by pre-loading weights into VRAM so that we can maximize throughput.

In our initial tests, the load time is already much better, and we can use RAM disk to preload models to some degree, but the inference times can be roughly double which breaks any gains we had from the improved load times.

We're new to the project, don't have any c++ specialists on the team, and honestly don't have the bandwidth to tackle this ourselves, but we'd love to see it get done and would be happy to try and chip in.

JustMaier avatar Aug 22 '25 20:08 JustMaier

My two cents:

sdcpp relies on the ggml library, so most performance improvements would need to happen there. For SDXL inference, here’s the list of operations involved:

  • ADD
  • CONCAT
  • CONT
  • DIAG_MASK_INF
  • GET_ROWS
  • GROUP_NORM
  • IM2COL (or CONV2D)
  • MUL
  • MUL_MAT
  • NORM
  • PERMUTE
  • RESHAPE
  • SCALE
  • SOFT_MAX
  • TIMESTEP_EMBEDDING
  • UNARY
  • UPSCALE
  • VIEW

The two critical ops here are MUL_MAT and CONV2D. The rest are mostly limited by memory bandwidth

Assuming you’re using the CUDA backend, you could try forcing cuBLAS for matrix multiplications. However, a direct CONV2D implementation is still missing in this backend, so a custom kernel (with an optional cuDNN path if available) may be needed. A low-hanging fruit would be to apply operator fusions (e.g. GROUP_NORM → MUL → ADD)

Flash Attention could also help, but ggml uses its own custom implementation, so it’s not directly comparable to PyTorch’s version. There are also more hacky techniques (like UNet block caching/reuse see https://github.com/leejet/stable-diffusion.cpp/pull/705), but that’s probably beyond the scope of what you’re looking into

rmatif avatar Aug 22 '25 20:08 rmatif

I would be happy to help with that, but in my case I only have access to a couple AMD GPUs, so I can really help to optimize for industry standard Cuda Hardware, which I'm assuming is what you're using.

On ROCm/hip stable-diffusion.cpp is already about twice as fast as ComfyUI, but that's just because AMD Hardware support in ComfyUI is very bad.

stduhpf avatar Aug 22 '25 21:08 stduhpf

I just thought that the least we can do here on sdcpp side is refactor GGMLRunner to eventually support multi-GPU inference

rmatif avatar Aug 22 '25 21:08 rmatif

@rmatif thanks for the tips. I was aware that there's some work being done on the CONV2D side and have seen some of the open PRs related to that: https://github.com/leejet/stable-diffusion.cpp/pull/744

I'll admit I'm not familiar enough with cpp or the cpp ML landscape here to be able to add much value, but I can bring hardware (we've rented hundreds of RTX 4090s), some funds to sponsor the effort, and a lot of demand to test things at scale 🤣

If we could get some time from people here, maybe we could prioritize:

  1. Force cuBLAS for GEMM everywhere it makes sense
  2. Add a real CUDA CONV2D path (maybe a direct conv kernal in ggml's CUDA backend)
  3. Operator fusions
  4. Dig into attention kernels (just heard about SageAttention, supposedly a 3x speedup over flash attention)
  5. Maybe something with UNET block caching like you mentioned.

The thing that really kicked off this effort was the amount of time we spend loading checkpoints and loras... We're really trying to get that load time to 0 by preloading models and loras to VRAM prior to job execution. Doing that with comfy or even pytorch just didn't work because they don't actually release memory when told to. SD.cpp doesn't have those same memory issues and seems like the ideal candidate to build on-top of. However, we can't lose inference time to save on load time, since we only have to load assets about 40% of the time due to our orchestration layer.

JustMaier avatar Aug 22 '25 21:08 JustMaier

I'm kind of a smaller fish here, but I'd be happy to help :-)

From the loading latency side, a low-hanging fruit could be support for reloading only parts of a checkpoint (say, changing the U-Net or LoRAs but keeping the current CLIPs and VAE). Maybe even reusing the common parts for more than one loaded checkpoint (not likely to be a huge gain, but it could become significant at your scale).

You may also miss some functionality, like a few samplers and a hi-res fix pass. You'll also need a server implementation... or maybe support for using sd.cpp as a custom Comfy node backend.

wbruna avatar Aug 22 '25 23:08 wbruna

Personally less of a ML person, but I can do c++ and I can speak to the flash attentions stuff that is in ggml.

There currently is still an issue of quality I introduced by over optimistically padding some tensors, so FA can be used, but that has not improved the speed much and lead to some quality issues. There is also the opportunity to instead use a forked ggml and add our tensor shape demands to the flash attention (I would have to look what sdxl uses here again, I know sd1 could benefit here).

Also I only have a single 8gig nvidia card :)

Green-Sky avatar Aug 23 '25 08:08 Green-Sky

we can't lose inference time to save on load time

I'm really surprised that the difference is actually that big. I already have a dumped SDXL ggml graph and need to do the same with vanilla ComfyUI (I suspect that even the default run has some optimizations applied) to compare and get a clearer view of what can be done.

As for SageAttention, refer to this discussion: https://github.com/ggml-org/llama.cpp/discussions/9901. Imo it would require a great effort across all backends, so I think it's unlikely to happen anytime soon

There is also the opportunity to instead use a forked ggml and add our tensor shape demands to the flash attention (I would have to look what sdxl uses here again, I know sd1 could benefit here).

For SDXL the tensor shapes match pretty well

Details
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:1189 - unet compute buffer size: 230.01 MB(RAM)
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:1024 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:1024 L_k:77 n_head:20 C:1280 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:4096 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention
[DEBUG] ggml_extend.hpp:866  - attention_ext L_q:4096 L_k:77 n_head:10 C:640 d_head:64 N:1
[DEBUG] ggml_extend.hpp:902  -  uses flash attention

EDIT: I don’t have an RTX 4090, but in my testing I got about an ~8% difference in sampling speed between sdcpp with FA and cuBLAS versus ComfyUI at default settings. I think we can close the gap with the optimizations I mentioned above

rmatif avatar Aug 23 '25 10:08 rmatif

@JustMaier I contributed to implementing and improving hardware acceleration support in CUDA #75. I can say there is still plenty of room for improvement, but I was experimenting with ways to boost performance in PyTorch (flash attention, Winograd for conv2d). However, I didn’t get the results I expected, probably due to my lack of low-level knowledge in CUDA. I was considering using CuDNN to perform conv2d operations natively since it achieves higher throughput, but I haven’t had the time yet.

FSSRepo avatar Aug 25 '25 06:08 FSSRepo

I contributed a Winograd conv2d operator to ggml (which didn't take off). I also developed photomaker feature in sd.cpp and have done quite some performance tests along the way. In my view,

  • the main bottleneck of sd.cpp is lack of operator fusion (ggml doesn't support). I profiled sd.cpp using nsight compute and saw scale taking quite some time, which presumably can be fused as epilog of other operators.
  • conv2d operator using im2col + gemm is slow and memory hungry, and should be replaced by other methods. There is already a direct convolution support in ggml for some backends (including cuda) and I am looking into implicit gemm (need some spare time) as demonstrated by cutlass showing very high flops.

To further improve the inference speed, one needs to do a comprehensive profiling to see which operators need further tuning. I believe some of operators in ggml's cuda backend are not optimal.

Personally, I think it is a pity if sd.cpp can not compete with a python based engine (pytorch). sd.cpp has high potential to be more performant.

bssrdf avatar Aug 25 '25 16:08 bssrdf

For SDXL the tensor shapes match pretty well Details

Yea that's nice (or sad 😄 ). This already seems to have triggered some up stream fattn improvements for 40 head dim (nice for sd1).

  • the main bottleneck of sd.cpp is lack of operator fusion (ggml doesn't support). I profiled sd.cpp using nsight compute and saw scale taking quite some time, which presumably can be fused as epilog of other operators.

There is now upstream support/precedent for operator fusion. https://github.com/ggml-org/llama.cpp/pull/14800 https://github.com/ggml-org/llama.cpp/pull/14907 (and other backends).

Green-Sky avatar Aug 25 '25 17:08 Green-Sky

For SDXL the tensor shapes match pretty well Details

Yea that's nice (or sad 😄 ). This already seems to have triggered some up stream fattn improvements for 40 head dim (nice for sd1).

  • the main bottleneck of sd.cpp is lack of operator fusion (ggml doesn't support). I profiled sd.cpp using nsight compute and saw scale taking quite some time, which presumably can be fused as epilog of other operators.

There is now upstream support/precedent for operator fusion. ggml-org/llama.cpp#14800 ggml-org/llama.cpp#14907 (and other backends).

@Green-Sky, thanks for the information. I haven't followed ggml's development for a while and am glad to know the fusion is being supported. sd.cpp's speed up is even more promising.

bssrdf avatar Aug 25 '25 17:08 bssrdf

Ok, our team spent a bit of time reviewing performance and testing different builds. At this point it seems like there are a few places we'd like to try and focus and we'd be happy to sponsor any of the following.

  • [ ] CuDNN for conv2d ops
  • [ ] Operator fusion
  • [ ] Potential model loading optimizations (we noticed that even loading from RAM, SDXL models can take 5-6s)
  • [ ] Other potential perf enhancements (Winograd, FA improvements, SageAttention cpp implementation)

I'd imagine that this would likely require an upstream fork of ggml, but maybe there'd be a way to get it back into main.

To kick things off, we're hoping to get specific individuals to take on specific tasks. We'd be happy to contribute dev environments with 4090s and sponsor individuals here through github or elsewhere if you contact me. If something in the list above seems in your wheelhouse and you wanna take a stab at it, just @ me and say what you want to do and when you think it'd be ready for review and then we can discuss how to best support you.

JustMaier avatar Aug 25 '25 17:08 JustMaier

I did give a try to add conv2d via cuDNN https://github.com/ggml-org/ggml/commit/43cb16928244b87400bd931b8b9712dec366f290. It works, but it’s currently slower than im2col + GEMM (slightly faster in the VAE phase though) because it doesn’t use tensor cores. Attempting mixed-precision source with the kernel always fails, so I implemented a temporary fallback. I need to dig deeper into the cuDNN documentation. Feel free to test

rmatif avatar Aug 26 '25 13:08 rmatif

@JustMaier , I'd like to tackle the model loading optimizations. Taking a look at the code, I believe there's room for improvement regardless of the backend, but I'd need to dig deeper, and run a few tests.

Could you show the commands and outputs you're using in your tests, and point me to the model files you are testing with, just so we get on the same page? Also, are you testing with the sd executable itself, or loading the library with another program? And how much did the RAM disk improve the loading time, versus hot/cold cache?

(and it may be best to move the specifics of this item to a separate issue?)

wbruna avatar Aug 26 '25 17:08 wbruna

and it may be best to move the specifics of this item to a separate issue?

Or a "discussion" maybe?

stduhpf avatar Aug 26 '25 17:08 stduhpf

it doesn’t use tensor cores.

@rmatif Looking at the cuDNN code you wrote, I see that it is indeed configured to use the tensor cores with cudnnMathType_t math_type = (i == 0) ? CUDNN_TENSOR_OP_MATH : CUDNN_DEFAULT_MATH;, but even so you say there is a degradation—very strange

FSSRepo avatar Aug 26 '25 20:08 FSSRepo

it doesn’t use tensor cores.

@rmatif Looking at the cuDNN code you wrote, I see that it is indeed configured to use the tensor cores with cudnnMathType_t math_type = (i == 0) ? CUDNN_TENSOR_OP_MATH : CUDNN_DEFAULT_MATH;, but even so you say there is a degradation—very strange

Yes, it is configured and I believe it's set up correctly. However, when it comes to an FP16 kernel, it doesn't return an algorithm for this type of operation and falls back to the inefficient FP32 x FP32 path. Although updating libcudnn to the latest version helped, and now performance is more or less on par with im2col + gemm.

Could you please test on your end by adding some debug logs to see if it is using tensor cores or if it's just falling back to the FP32 path?

rmatif avatar Aug 26 '25 21:08 rmatif

I believe that conv2d is the bottleneck of sdxl. Recently, I plan to implement conv2d/conv3d using the implicit gemm method to reduce memory usage and improve inference speed. However, I'm currently working on adding support for Wan Video Model https://github.com/leejet/stable-diffusion.cpp/pull/778. I don't have the time to do it right now. Once I have added support for wan and qwen images, I might find some time to work on this.

By the way, when using flash attn, on the 4090 GPU, the inference speed of the Wan Video Model in sd.cpp is basically the same as that of comfyui. Sometimes it can even be faster.

leejet avatar Aug 29 '25 15:08 leejet

In fact, there are still many ideas I have for sd.cpp that I want to implement. Including a more user-friendly code framework, performance optimization, and so on. But due to time constraints, I can only say, let's take it one step at a time.

leejet avatar Aug 29 '25 15:08 leejet

I have fixed the use of flash attn in this PR https://github.com/leejet/stable-diffusion.cpp/pull/778, and now all the attn in sdxl can use flash attn. In terms of generating large images, there is indeed room for improvement. However, in terms of generating 512x512 images, it is even faster than comfyui.

My test device is a RTX 4090.

sd.cpp (--diffusion-fa) comfyui
512x512 11.30it/s 10.99it/s
768x768 6.46it/s 9.56it/s
1024x1024 4.96it/s 7.57it/s

leejet avatar Aug 30 '25 08:08 leejet

I have fixed the use of flash attn in this PR #778, and now all the attn in sdxl can use flash attn. In terms of generating large images, there is indeed room for improvement. However, in terms of generating 512x512 images, it is even faster than comfyui.

My test device is a RTX 4090.

sd.cpp (--diffusion-fa) comfyui 512x512 11.30it/s 10.99it/s 768x768 6.46it/s 9.56it/s 1024x1024 4.96it/s 7.57it/s

On my conv2d-cudnn branch, I can reach up to 5.42 it/s at 1024×1024. But the gap with comfyui is still huge. I think at this resolution all the attn were already using flash attn anyway

rmatif avatar Aug 30 '25 09:08 rmatif

Honestly, I never noticed any improvement with ggml’s Flash Attention kernels. In terms of memory, it’s noticeable (in SD 1.5 it’s only about -1MB from what I’ve seen), but in performance it’s either worse or negligible in some cases compared to using the native matrix multiplication and softmax kernels. From my tests, the real performance bottleneck is the conv2d matrix multiplication — only about 5 ms are spent on attention, but 80–100 ms on convolution

FSSRepo avatar Sep 01 '25 06:09 FSSRepo

Honestly, I never noticed any improvement with ggml’s Flash Attention kernels. In terms of memory, it’s noticeable (in SD 1.5 it’s only about -1MB from what I’ve seen), but in performance it’s either worse or negligible in some cases compared to using the native matrix multiplication and softmax kernels. From my tests, the real performance bottleneck is the conv2d matrix multiplication — only about 5 ms are spent on attention, but 80–100 ms on convolution

Remember, I took your flash attention stuff and made it work here https://github.com/leejet/stable-diffusion.cpp/pull/386 . Includes a modest uplift and memory reduction for sdxl too (especially for lager images).

Green-Sky avatar Sep 01 '25 08:09 Green-Sky

@JustMaier , I opened a discussion about the model loading optimization: #789 .

wbruna avatar Sep 05 '25 00:09 wbruna

I really like optimizations, BUT to be honest, i use sdcpp on a low end GPU... flash attention 2/3, sage attention, MUL_MAT other than FMA will break support for all those GPU's older than CDNA and RDNA for the AMD side.

i use ComfyUI on my gfx900 AMD Instinct MI25 and at the moment i have to patch Torch, lots of PyPackages, Samplers cause of optimizations for newer GPU_Archs. I know Progress has to be done but... Make it fair and optional.

Take the Triton 3.4.0 the only thing to get it running again on gfx900 ist to patch the gfx900 and GCN5 checks back in.

using https://github.com/leejet/stable-diffusion.cpp/commit/abb115cd021fc2beed826604ed1a479b6a77671c from today

Using CUDA backend ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Pro WX 9100, gfx900:xnack- (0x900), VMM: no, Wave Size: 64

My observations SDXL 1024x1024: ComfyUI ~3.3s/it (using opt-sub-quad-attention - works fine) sd.cpp ~3.3s/it so the baseline for me is exactly the same. (don't compare a 40€ 16GB card with a 4090)

enabling flash attention ~20s/it enabling conv2d ~20s/it enabling conv2d from ~8s for vae to ~50s/it

So again i really like to see progress and more speed BUT Please make it just optional with a switch like -mulmat, -fa2, -fa3, sageattention. cause there are no kernels for older GPU's

phil2sat avatar Sep 09 '25 05:09 phil2sat

Some updates:

  • A new implicit conv2d for cuda backend is added. PR is https://github.com/ggml-org/llama.cpp/pull/15805. Overall, it is much faster than IM2COL and on par with pytroch/cudnn aggregated for all input/filter shapes.
  • A prototype of full FP16 UNET denoising pipeline is built. Currently the output of intermediate layers (FA, MLP, CONV2D etc) all output FP32 tensors. Even though internally these ops are done in FP16 but at the end the results needed to be converted to feed next layer. The new pipeline totally eliminated this conversion.

Combining these two above, on 4090, I can do

sd.cpp (--diffusion-fa) sd.cpp (--diffusion-fa, implicit conv2d, fp16 pipeline) diffusion-fast
512x512 8.94it/s 11.56it/s ~16it/s
768x768 6.29it/s 8it/s ~13.5it/s
1024x1024 4.75it/s 6.35it/s ~8it/s

bssrdf avatar Nov 24 '25 15:11 bssrdf