stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

[Bug] Performance regression

Open daniandtheweb opened this issue 1 month ago • 7 comments

Git commit

347710f68f6c6c8e243496957f056a4b9f271d24

Operating System & Version

"Arch"

GGML backends

Vulkan

Command-line arguments used

./sd -M img_gen -p "a cat" --sampling-method euler_a --steps 20 --scheduler gits -W 1024 -H 1024 -b 1 --cfg-scale 5 -s -1 --clip-skip -1 --embd-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/embeddings/ --lora-model-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/loras/ -t 0 --rng cuda --sampler-rng cuda --lora-apply-mode auto -o /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1763304506.png --model /home/daniandtheweb/Workspace/sd.cpp-webui/models/checkpoints/plantMilkModelSuite_hempII.safetensors --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/sdxl_vae_fp16_fix.safetensors --preview proj --preview-path /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1763304506_preview.png --preview-interval 1 --diffusion-fa --vae-conv-direct --color

Steps to reproduce

Run the generation.

What you expected to happen

Performance of about 1.09 s/it

What actually happened

Performance of about 2.47 s/it

Additional context / environment details

I noticed that others have mentioned similar slowdowns in the PR discussion itself, but I think it needs a separate issue so the regression doesn’t get lost and can be tracked properly.

The regression has been tested on my end on a Radeon RX 7800XT using lora apply mode both immediately and at runtime with the same results.

daniandtheweb avatar Nov 17 '25 01:11 daniandtheweb

I haven’t encountered any performance issues in my testing. Could you provide more details? For example, the commit you used for comparison.

leejet avatar Nov 30 '25 04:11 leejet

I'm comparing directly 59ebdf0bb5b3a6c83d92ca90fd820707fb154e9d (before the regression) with 347710f68f6c6c8e243496957f056a4b9f271d24 (where the regression started). The performance penalty is still present on the current master 3c1187ce83d21b1e7fe31a7e61a2398e82eecfb2.

daniandtheweb avatar Nov 30 '25 13:11 daniandtheweb

Just to complete the report I'd like to add that this regression has been tested using SDXL models. I haven't tested any other model for possible regressions.

daniandtheweb avatar Dec 12 '25 12:12 daniandtheweb

It happens on my card, too; both with Vulkan and ROCm.

Average per-step cost on a cfg 6, 20 step gen; the models only have f16 weights.

SDXL 1024x1024:

version vulkan rocm
59ebdf0 2.76s/it 1.80s/it
347710f 4.47s/it 3.75s/it
master-408 4.48s/it 3.74s/it

SD1.5 1024x1024:

version vulkan rocm
59ebdf0 2.65s/it 2.34s/it
347710f 3.65s/it 3.44s/it
[DEBUG] ggml_extend.hpp:66   - ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

[INFO ] ggml_extend.hpp:69   - ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[INFO ] ggml_extend.hpp:69   - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:69   - ggml_cuda_init: found 1 ROCm devices:
[INFO ] ggml_extend.hpp:69   -   Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32

Edit: added SD1.5 numbers.

wbruna avatar Dec 12 '25 15:12 wbruna

I'm also noticing a slowdown, though not nearly as bad (2.97it/s vs 2.19it/s) for sd1 512x512.

stduhpf avatar Dec 12 '25 15:12 stduhpf

I tracked the performance regression down to the GEGLU changes: https://github.com/leejet/stable-diffusion.cpp/commit/347710f68f6c6c8e243496957f056a4b9f271d24#diff-815b414bb91f23155827e50a78efdae23e3ed87e63fc47c1b99d2858338f301bL185-R199 Reverting these changes bring the performance back in my case. It might cause issues with at_runtime LoRAs though.

diff --git a/common.hpp b/common.hpp
index dd8281f..6147677 100644       
--- a/common.hpp
+++ b/common.hpp
@@ -182,21 +182,35 @@ protected:
     int64_t dim_in;
     int64_t dim_out;

+    void init_params(struct ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, std::string prefix = "") override {
+        enum ggml_type wtype = get_type(prefix + "proj.weight", tensor_storage_map, GGML_TYPE_F32);
+        enum ggml_type bias_wtype = GGML_TYPE_F32;
+
+        params["proj.weight"] = ggml_new_tensor_2d(ctx, wtype, dim_in, dim_out * 2);
+        params["proj.bias"] = ggml_new_tensor_1d(ctx, bias_wtype, dim_out * 2);
+    }
+
 public:
     GEGLU(int64_t dim_in, int64_t dim_out)
         : dim_in(dim_in), dim_out(dim_out) {
-        blocks["proj"] = std::shared_ptr<GGMLBlock>(new Linear(dim_in, dim_out * 2));
     }

     struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) override {
         // x: [ne3, ne2, ne1, dim_in]
         // return: [ne3, ne2, ne1, dim_out]
-        auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);

-        x          = proj->forward(ctx, x);  // [ne3, ne2, ne1, dim_out*2]
-        auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
-        x          = x_vec[0];  // [ne3, ne2, ne1, dim_out]
-        auto gate  = x_vec[1];  // [ne3, ne2, ne1, dim_out]
+        struct ggml_tensor* w = params["proj.weight"];
+        struct ggml_tensor* b = params["proj.bias"];
+
+        auto x_w = ggml_view_2d(ctx->ggml_ctx, w, w->ne[0], w->ne[1] / 2, w->nb[1], 0);  // [dim_out, dim_in]
+        auto x_b = ggml_view_1d(ctx->ggml_ctx, b, b->ne[0] / 2, 0);  // [dim_out, dim_in]
+        auto gate_w = ggml_view_2d(ctx->ggml_ctx, w, w->ne[0], w->ne[1] / 2, w->nb[1], w->nb[1] * w->ne[1] / 2);  // [dim_out, ]
+        auto gate_b = ggml_view_1d(ctx->ggml_ctx, b, b->ne[0] / 2, b->nb[0] * b->ne[0] / 2);  // [dim_out, ]
+
+        auto x_in = x;
+
+        x = ggml_ext_linear(ctx->ggml_ctx, x_in, x_w, x_b);  // [ne3, ne2, ne1, dim_out]
+        auto gate = ggml_ext_linear(ctx->ggml_ctx, x_in, gate_w, gate_b);  // [ne3, ne2, ne1, dim_out]

         gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);

stduhpf avatar Dec 12 '25 16:12 stduhpf

This simpler patch works just as well and should not break anything:

diff --git a/common.hpp b/common.hpp
index 33d499f..c146d46 100644       
--- a/common.hpp
+++ b/common.hpp
@@ -193,11 +193,12 @@ public:
         // return: [ne3, ne2, ne1, dim_out]
         auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);

-        x          = proj->forward(ctx, x);  // [ne3, ne2, ne1, dim_out*2]
-        auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
-        x          = x_vec[0];  // [ne3, ne2, ne1, dim_out]
-        auto gate  = x_vec[1];  // [ne3, ne2, ne1, dim_out]
+        x = proj->forward(ctx, x);  // [ne3, ne2, ne1, dim_out*2]

+        auto gate = ggml_view_4d(ctx->ggml_ctx, x, dim_out, x->ne[1], x->ne[2], x->ne[3], x->nb[1], x->nb[2], x->nb[3], dim_out * x->nb[0]);
+        x         = ggml_view_4d(ctx->ggml_ctx, x, dim_out, x->ne[1], x->ne[2], x->ne[3], x->nb[1], x->nb[2], x->nb[3], 0);
+
+        gate = ggml_cont(ctx->ggml_ctx, gate);
         gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);

         x = ggml_mul(ctx->ggml_ctx, x, gate);  // [ne3, ne2, ne1, dim_out]

stduhpf avatar Dec 12 '25 16:12 stduhpf

Fixed with #1084

daniandtheweb avatar Dec 16 '25 16:12 daniandtheweb