[Bug] Performance regression
Git commit
347710f68f6c6c8e243496957f056a4b9f271d24
Operating System & Version
"Arch"
GGML backends
Vulkan
Command-line arguments used
./sd -M img_gen -p "a cat" --sampling-method euler_a --steps 20 --scheduler gits -W 1024 -H 1024 -b 1 --cfg-scale 5 -s -1 --clip-skip -1 --embd-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/embeddings/ --lora-model-dir /home/daniandtheweb/Workspace/sd.cpp-webui/models/loras/ -t 0 --rng cuda --sampler-rng cuda --lora-apply-mode auto -o /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1763304506.png --model /home/daniandtheweb/Workspace/sd.cpp-webui/models/checkpoints/plantMilkModelSuite_hempII.safetensors --vae /home/daniandtheweb/Workspace/sd.cpp-webui/models/vae/sdxl_vae_fp16_fix.safetensors --preview proj --preview-path /home/daniandtheweb/Workspace/sd.cpp-webui/outputs/txt2img/1763304506_preview.png --preview-interval 1 --diffusion-fa --vae-conv-direct --color
Steps to reproduce
Run the generation.
What you expected to happen
Performance of about 1.09 s/it
What actually happened
Performance of about 2.47 s/it
Additional context / environment details
I noticed that others have mentioned similar slowdowns in the PR discussion itself, but I think it needs a separate issue so the regression doesn’t get lost and can be tracked properly.
The regression has been tested on my end on a Radeon RX 7800XT using lora apply mode both immediately and at runtime with the same results.
I haven’t encountered any performance issues in my testing. Could you provide more details? For example, the commit you used for comparison.
I'm comparing directly 59ebdf0bb5b3a6c83d92ca90fd820707fb154e9d (before the regression) with 347710f68f6c6c8e243496957f056a4b9f271d24 (where the regression started). The performance penalty is still present on the current master 3c1187ce83d21b1e7fe31a7e61a2398e82eecfb2.
Just to complete the report I'd like to add that this regression has been tested using SDXL models. I haven't tested any other model for possible regressions.
It happens on my card, too; both with Vulkan and ROCm.
Average per-step cost on a cfg 6, 20 step gen; the models only have f16 weights.
SDXL 1024x1024:
| version | vulkan | rocm |
|---|---|---|
| 59ebdf0 | 2.76s/it | 1.80s/it |
| 347710f | 4.47s/it | 3.75s/it |
| master-408 | 4.48s/it | 3.74s/it |
SD1.5 1024x1024:
| version | vulkan | rocm |
|---|---|---|
| 59ebdf0 | 2.65s/it | 2.34s/it |
| 347710f | 3.65s/it | 3.44s/it |
[DEBUG] ggml_extend.hpp:66 - ggml_vulkan: 0 = AMD Radeon RX 7600 XT (RADV NAVI33) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
[INFO ] ggml_extend.hpp:69 - ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[INFO ] ggml_extend.hpp:69 - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:69 - ggml_cuda_init: found 1 ROCm devices:
[INFO ] ggml_extend.hpp:69 - Device 0: AMD Radeon RX 7600 XT, gfx1102 (0x1102), VMM: no, Wave Size: 32
Edit: added SD1.5 numbers.
I'm also noticing a slowdown, though not nearly as bad (2.97it/s vs 2.19it/s) for sd1 512x512.
I tracked the performance regression down to the GEGLU changes: https://github.com/leejet/stable-diffusion.cpp/commit/347710f68f6c6c8e243496957f056a4b9f271d24#diff-815b414bb91f23155827e50a78efdae23e3ed87e63fc47c1b99d2858338f301bL185-R199 Reverting these changes bring the performance back in my case. It might cause issues with at_runtime LoRAs though.
diff --git a/common.hpp b/common.hpp
index dd8281f..6147677 100644
--- a/common.hpp
+++ b/common.hpp
@@ -182,21 +182,35 @@ protected:
int64_t dim_in;
int64_t dim_out;
+ void init_params(struct ggml_context* ctx, const String2TensorStorage& tensor_storage_map = {}, std::string prefix = "") override {
+ enum ggml_type wtype = get_type(prefix + "proj.weight", tensor_storage_map, GGML_TYPE_F32);
+ enum ggml_type bias_wtype = GGML_TYPE_F32;
+
+ params["proj.weight"] = ggml_new_tensor_2d(ctx, wtype, dim_in, dim_out * 2);
+ params["proj.bias"] = ggml_new_tensor_1d(ctx, bias_wtype, dim_out * 2);
+ }
+
public:
GEGLU(int64_t dim_in, int64_t dim_out)
: dim_in(dim_in), dim_out(dim_out) {
- blocks["proj"] = std::shared_ptr<GGMLBlock>(new Linear(dim_in, dim_out * 2));
}
struct ggml_tensor* forward(GGMLRunnerContext* ctx, struct ggml_tensor* x) override {
// x: [ne3, ne2, ne1, dim_in]
// return: [ne3, ne2, ne1, dim_out]
- auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);
- x = proj->forward(ctx, x); // [ne3, ne2, ne1, dim_out*2]
- auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
- x = x_vec[0]; // [ne3, ne2, ne1, dim_out]
- auto gate = x_vec[1]; // [ne3, ne2, ne1, dim_out]
+ struct ggml_tensor* w = params["proj.weight"];
+ struct ggml_tensor* b = params["proj.bias"];
+
+ auto x_w = ggml_view_2d(ctx->ggml_ctx, w, w->ne[0], w->ne[1] / 2, w->nb[1], 0); // [dim_out, dim_in]
+ auto x_b = ggml_view_1d(ctx->ggml_ctx, b, b->ne[0] / 2, 0); // [dim_out, dim_in]
+ auto gate_w = ggml_view_2d(ctx->ggml_ctx, w, w->ne[0], w->ne[1] / 2, w->nb[1], w->nb[1] * w->ne[1] / 2); // [dim_out, ]
+ auto gate_b = ggml_view_1d(ctx->ggml_ctx, b, b->ne[0] / 2, b->nb[0] * b->ne[0] / 2); // [dim_out, ]
+
+ auto x_in = x;
+
+ x = ggml_ext_linear(ctx->ggml_ctx, x_in, x_w, x_b); // [ne3, ne2, ne1, dim_out]
+ auto gate = ggml_ext_linear(ctx->ggml_ctx, x_in, gate_w, gate_b); // [ne3, ne2, ne1, dim_out]
gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);
This simpler patch works just as well and should not break anything:
diff --git a/common.hpp b/common.hpp
index 33d499f..c146d46 100644
--- a/common.hpp
+++ b/common.hpp
@@ -193,11 +193,12 @@ public:
// return: [ne3, ne2, ne1, dim_out]
auto proj = std::dynamic_pointer_cast<Linear>(blocks["proj"]);
- x = proj->forward(ctx, x); // [ne3, ne2, ne1, dim_out*2]
- auto x_vec = ggml_ext_chunk(ctx->ggml_ctx, x, 2, 0);
- x = x_vec[0]; // [ne3, ne2, ne1, dim_out]
- auto gate = x_vec[1]; // [ne3, ne2, ne1, dim_out]
+ x = proj->forward(ctx, x); // [ne3, ne2, ne1, dim_out*2]
+ auto gate = ggml_view_4d(ctx->ggml_ctx, x, dim_out, x->ne[1], x->ne[2], x->ne[3], x->nb[1], x->nb[2], x->nb[3], dim_out * x->nb[0]);
+ x = ggml_view_4d(ctx->ggml_ctx, x, dim_out, x->ne[1], x->ne[2], x->ne[3], x->nb[1], x->nb[2], x->nb[3], 0);
+
+ gate = ggml_cont(ctx->ggml_ctx, gate);
gate = ggml_gelu_inplace(ctx->ggml_ctx, gate);
x = ggml_mul(ctx->ggml_ctx, x, gate); // [ne3, ne2, ne1, dim_out]
Fixed with #1084