stable-diffusion.cpp icon indicating copy to clipboard operation
stable-diffusion.cpp copied to clipboard

How to reduce condition graph computing time?

Open hxgqh opened this issue 1 year ago • 10 comments

it consumes 34 seconds, while sampling only consumes 5.5s/it

image

hxgqh avatar Oct 07 '24 08:10 hxgqh

flux employs t5xxl, which is relatively heavy. And it is only implemented to run on cpu right now.

Green-Sky avatar Oct 07 '24 08:10 Green-Sky

@hxgqh Set the numbers of threads with the -t argument. With a single thread it also takes around 34 seconds on my CPU. With -t 24 (I have a 24 thread CPU) , it only takes around 3.5 seconds

stduhpf avatar Oct 07 '24 11:10 stduhpf

@stduhpf I changed threads to be 2/4/8/24/48/96, and it seems not working. Maybe the default threads is set to be cores of cpu? #3

hxgqh avatar Oct 07 '24 12:10 hxgqh

You're right @hxgqh , it uses the number of physical cores of the CPU by default. (so 12 in my case) If I don't set the -t argument at all, it takes 4 seconds with 12 threads.

Then I guess either your CPU is too slow, or you're running out of system memory and it's using swap (if that's the case, maybe using a quantized version of t5xxxl could help).

stduhpf avatar Oct 07 '24 13:10 stduhpf

@Green-Sky Is there any plan to run t5xxl on GPU?

hxgqh avatar Oct 08 '24 09:10 hxgqh

it seems smth is off with it, that same t5xxl step on ryzen 5600x the python cpu implementation (from diffusers lib) running 67 seconds faster (237%) than CPP implementation

and that's actually the best of sd-cpp measurements, with 6 threads, when setting to 12 (nproc) - it becomes 22 seconds even slower

  • python diffusers (cpu) - 52 seconds
  • sd-cpp 3 threads - 199 seconds
  • sd-cpp 6 threads - 119 seconds
  • sd-cpp 12 (nproc) threads - 141 seconds
  • sd-cpp 16 threads - 147 seconds

offtopic: inference step takes ~9 seconds for both diffusers gpu and sd-cpp gpu, however for achieving the same quality of result diffusers taking 4 steps, and sd-cpp needs 20 - but i guess the latter is smth to do with the default parameters

actionless avatar Nov 16 '24 00:11 actionless

i found the solution for my problem with high condition graph computing time, mb would be useful for someone else:

because i was building the package on old xeon server, but running on a new ryzen workstation - apparently some cpu optimizations were disabled during compilation

so after re-compiling it on the workstation itself - condition graph computes now in 18 seconds (34 seconds faster than diffusers-cpu and 100 seconds faster than before, without those optimizations) in the same testcase

actionless avatar Nov 20 '24 03:11 actionless

Just some stat.

  • Compiled with -DGGML_NATIVE=OFF so it's using SSE4.2, and run with -t 1 on Intel i9-10900 it takes 15 minutes for t5xxl_fp16 step. Without -t (which is 20 threads) it takes 101 seconds.
  • Compiled with -DGGML_NATIVE=ON on AMD EPYC 7543, so it's only AVX2&F16C-optimized, and run on i9-10900 without -t (i.e. 20 threads) it takes 11 seconds for t5 step. Upstream sd binary from the master-10feacf release page results in almost the same speed.
  • Previous build, when run on even newer Xeon(R) Gold 5420+ (with 8 threads) takes 30 seconds.

@stduhpf I'm super curious: what hardware do you have to run it in just 3.5 seconds on 24 threads, and how do you build sd? Is it built for avx512*?

vt-alt avatar Mar 18 '25 03:03 vt-alt

@vt-alt I don't have anything special, just a R9 5900X (12 cores/24 threads, with AVX2 ans F16C), with 32 GB dual-channel DDR4@3333MHz. I'm thinking memory bandwidth might be a bottleneck for computing the conditioning? (Also I'm using a q4_k quantization for t5xxl, this can have a big influence)

stduhpf avatar Mar 18 '25 15:03 stduhpf

Also I'm using a q4_k quantization for t5xxl

that must be it, i can confirm that quantization affects the time a lot

actionless avatar Mar 18 '25 15:03 actionless