stable-diffusion.cpp [Bug] Performance regression

Git commit

347710f68f6c6c8e243496957f056a4b9f271d24

Operating System & Version

"Arch"

GGML backends

Vulkan

Command-line arguments used

./sd -M img_gen -p "a cat" --sampling-method euler_a --steps 20 --scheduler gits -W 1024 -H 1024 -b 1 --cfg-scale 5 -s -1 --clip-skip -1 --embd-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/embeddings/ --lora-model-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/loras/ -t 0 --rng cuda --sampler-rng cuda --lora-apply-mode auto -o /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1763304506.png --model /home/daniandtheweb/Workspace/sd.cpp-webui/models/checkpoints/plantMilkModelSuite_hempII.safetensors --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/sdxl_vae_fp16_fix.safetensors --preview proj --preview-path /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1763304506_preview.png --preview-interval 1 --diffusion-fa --vae-conv-direct --color

Steps to reproduce

Run the generation.

What you expected to happen

Performance of about 1.09 s/it

What actually happened

Performance of about 2.47 s/it

Additional context / environment details

I noticed that others have mentioned similar slowdowns in the PR discussion itself, but I think it needs a separate issue so the regression doesn’t get lost and can be tracked properly.

The regression has been tested on my end on a Radeon RX 7800XT using lora apply mode both immediately and at runtime with the same results.

Nov 17 '25 01:11 daniandtheweb

I haven’t encountered any performance issues in my testing. Could you provide more details? For example, the commit you used for comparison.

Nov 30 '25 04:11 leejet

I'm comparing directly 59ebdf0bb5b3a6c83d92ca90fd820707fb154e9d (before the regression) with 347710f68f6c6c8e243496957f056a4b9f271d24 (where the regression started). The performance penalty is still present on the current master 3c1187ce83d21b1e7fe31a7e61a2398e82eecfb2.

Nov 30 '25 13:11 daniandtheweb

Just to complete the report I'd like to add that this regression has been tested using SDXL models. I haven't tested any other model for possible regressions.

Dec 12 '25 12:12 daniandtheweb

It happens on my card, too; both with Vulkan and ROCm.

Average per-step cost on a cfg 6, 20 step gen; the models only have f16 weights.

SDXL 1024x1024:

version	vulkan	rocm
59ebdf0	2.76s/it	1.80s/it
347710f	4.47s/it	3.75s/it
master-408	4.48s/it	3.74s/it

SD1.5 1024x1024:

version	vulkan	rocm
59ebdf0	2.65s/it	2.34s/it
347710f	3.65s/it	3.44s/it

[DEBUG] ggml_extend.hpp:66   - ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

[INFO ] ggml_extend.hpp:69   - ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[INFO ] ggml_extend.hpp:69   - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:69   - ggml_cuda_init: found 1 ROCm devices:
[INFO ] ggml_extend.hpp:69   -   Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32

Edit: added SD1.5 numbers.

Dec 12 '25 15:12 wbruna

I'm also noticing a slowdown, though not nearly as bad (2.97it/s vs 2.19it/s) for sd1 512x512.

Dec 12 '25 15:12 stduhpf

I tracked the performance regression down to the GEGLU changes: https://github.com/leejet/stable-diffusion.cpp/commit/347710f68f6c6c8e243496957f056a4b9f271d24#diff-815b414bb91f23155827e50a78efdae23e3ed87e63fc47c1b99d2858338f301bL185-R199 Reverting these changes bring the performance back in my case. It might cause issues with at_runtime LoRAs though.

diff --git a/common.hpp b/common.hpp
index dd8281f..6147677 100644       
--- a/common.hpp
+++ b/common.hpp
@@ -182,21 +182,35 @@ protected:
     int64_t dim_in;
     int64_t dim_out;

+    void init_params(struct ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, std::string prefix = "") override {
+        enum ggml_type wtype = get_type(prefix + "proj.weight", tensor_storage_map, GGML_TYPE_F32);
+        enum ggml_type bias_wtype = GGML_TYPE_F32;
+
+        params["proj.weight"] = ggml_new_tensor_2d(ctx, wtype, dim_in, dim_out * 2);
+        params["proj.bias"] = ggml_new_tensor_1d(ctx, bias_wtype, dim_out * 2);
+    }
+
 public:
     GEGLU(int64_t dim_in, int64_t dim_out)
         : dim_in(dim_in), dim_out(dim_out) {
-        blocks["proj"] = std::shared_ptr<GGMLBlock>(new Linear(dim_in, dim_out * 2));
     }

     struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) override {
         // x: [ne3, ne2, ne1, dim_in]
         // return: [ne3, ne2, ne1, dim_out]
-        auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);

-        x          = proj->forward(ctx, x);  // [ne3, ne2, ne1, dim_out*2]
-        auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
-        x          = x_vec[0];  // [ne3, ne2, ne1, dim_out]
-        auto gate  = x_vec[1];  // [ne3, ne2, ne1, dim_out]
+        struct ggml_tensor* w = params["proj.weight"];
+        struct ggml_tensor* b = params["proj.bias"];
+
+        auto x_w = ggml_view_2d(ctx->ggml_ctx, w, w->ne[0], w->ne[1] / 2, w->nb[1], 0);  // [dim_out, dim_in]
+        auto x_b = ggml_view_1d(ctx->ggml_ctx, b, b->ne[0] / 2, 0);  // [dim_out, dim_in]
+        auto gate_w = ggml_view_2d(ctx->ggml_ctx, w, w->ne[0], w->ne[1] / 2, w->nb[1], w->nb[1] * w->ne[1] / 2);  // [dim_out, ]
+        auto gate_b = ggml_view_1d(ctx->ggml_ctx, b, b->ne[0] / 2, b->nb[0] * b->ne[0] / 2);  // [dim_out, ]
+
+        auto x_in = x;
+
+        x = ggml_ext_linear(ctx->ggml_ctx, x_in, x_w, x_b);  // [ne3, ne2, ne1, dim_out]
+        auto gate = ggml_ext_linear(ctx->ggml_ctx, x_in, gate_w, gate_b);  // [ne3, ne2, ne1, dim_out]

         gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);

Dec 12 '25 16:12 stduhpf

This simpler patch works just as well and should not break anything:

diff --git a/common.hpp b/common.hpp
index 33d499f..c146d46 100644       
--- a/common.hpp
+++ b/common.hpp
@@ -193,11 +193,12 @@ public:
         // return: [ne3, ne2, ne1, dim_out]
         auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);

-        x          = proj->forward(ctx, x);  // [ne3, ne2, ne1, dim_out*2]
-        auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
-        x          = x_vec[0];  // [ne3, ne2, ne1, dim_out]
-        auto gate  = x_vec[1];  // [ne3, ne2, ne1, dim_out]
+        x = proj->forward(ctx, x);  // [ne3, ne2, ne1, dim_out*2]

+        auto gate = ggml_view_4d(ctx->ggml_ctx, x, dim_out, x->ne[1], x->ne[2], x->ne[3], x->nb[1], x->nb[2], x->nb[3], dim_out * x->nb[0]);
+        x         = ggml_view_4d(ctx->ggml_ctx, x, dim_out, x->ne[1], x->ne[2], x->ne[3], x->nb[1], x->nb[2], x->nb[3], 0);
+
+        gate = ggml_cont(ctx->ggml_ctx, gate);
         gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);

         x = ggml_mul(ctx->ggml_ctx, x, gate);  // [ne3, ne2, ne1, dim_out]

Dec 12 '25 16:12 stduhpf

Fixed with #1084

Dec 16 '25 16:12 daniandtheweb