stable-diffusion.cpp How to reduce condition graph computing time?

it consumes 34 seconds, while sampling only consumes 5.5s/it

Oct 07 '24 08:10 hxgqh

flux employs t5xxl, which is relatively heavy. And it is only implemented to run on cpu right now.

Oct 07 '24 08:10 Green-Sky

@hxgqh Set the numbers of threads with the -t argument. With a single thread it also takes around 34 seconds on my CPU. With -t 24 (I have a 24 thread CPU) , it only takes around 3.5 seconds

Oct 07 '24 11:10 stduhpf

@stduhpf I changed threads to be 2/4/8/24/48/96, and it seems not working. Maybe the default threads is set to be cores of cpu? #3

Oct 07 '24 12:10 hxgqh

You're right @hxgqh , it uses the number of physical cores of the CPU by default. (so 12 in my case) If I don't set the -t argument at all, it takes 4 seconds with 12 threads.

Then I guess either your CPU is too slow, or you're running out of system memory and it's using swap (if that's the case, maybe using a quantized version of t5xxxl could help).

Oct 07 '24 13:10 stduhpf

@Green-Sky Is there any plan to run t5xxl on GPU?

Oct 08 '24 09:10 hxgqh

it seems smth is off with it, that same t5xxl step on ryzen 5600x the python cpu implementation (from diffusers lib) running 67 seconds faster (237%) than CPP implementation

and that's actually the best of sd-cpp measurements, with 6 threads, when setting to 12 (nproc) - it becomes 22 seconds even slower

python diffusers (cpu) - 52 seconds
sd-cpp 3 threads - 199 seconds
sd-cpp 6 threads - 119 seconds
sd-cpp 12 (nproc) threads - 141 seconds
sd-cpp 16 threads - 147 seconds

offtopic: inference step takes ~9 seconds for both diffusers gpu and sd-cpp gpu, however for achieving the same quality of result diffusers taking 4 steps, and sd-cpp needs 20 - but i guess the latter is smth to do with the default parameters

Nov 16 '24 00:11 actionless

i found the solution for my problem with high condition graph computing time, mb would be useful for someone else:

because i was building the package on old xeon server, but running on a new ryzen workstation - apparently some cpu optimizations were disabled during compilation

so after re-compiling it on the workstation itself - condition graph computes now in 18 seconds (34 seconds faster than diffusers-cpu and 100 seconds faster than before, without those optimizations) in the same testcase

Nov 20 '24 03:11 actionless

Just some stat.

Compiled with -DGGML_NATIVE=OFF so it's using SSE4.2, and run with -t 1 on Intel i9-10900 it takes 15 minutes for t5xxl_fp16 step. Without -t (which is 20 threads) it takes 101 seconds.
Compiled with -DGGML_NATIVE=ON on AMD EPYC 7543, so it's only AVX2&F16C-optimized, and run on i9-10900 without -t (i.e. 20 threads) it takes 11 seconds for t5 step. Upstream sd binary from the master-10feacf release page results in almost the same speed.
Previous build, when run on even newer Xeon(R) Gold 5420+ (with 8 threads) takes 30 seconds.

@stduhpf I'm super curious: what hardware do you have to run it in just 3.5 seconds on 24 threads, and how do you build sd? Is it built for avx512*?

Mar 18 '25 03:03 vt-alt

@vt-alt I don't have anything special, just a R9 5900X (12 cores/24 threads, with AVX2 ans F16C), with 32 GB dual-channel DDR4@3333MHz. I'm thinking memory bandwidth might be a bottleneck for computing the conditioning? (Also I'm using a q4_k quantization for t5xxl, this can have a big influence)

Mar 18 '25 15:03 stduhpf

Also I'm using a q4_k quantization for t5xxl

that must be it, i can confirm that quantization affects the time a lot

Mar 18 '25 15:03 actionless