How to reduce condition graph computing time?
it consumes 34 seconds, while sampling only consumes 5.5s/it
flux employs t5xxl, which is relatively heavy. And it is only implemented to run on cpu right now.
@hxgqh Set the numbers of threads with the -t argument.
With a single thread it also takes around 34 seconds on my CPU.
With -t 24 (I have a 24 thread CPU) , it only takes around 3.5 seconds
@stduhpf I changed threads to be 2/4/8/24/48/96, and it seems not working. Maybe the default threads is set to be cores of cpu? #3
You're right @hxgqh , it uses the number of physical cores of the CPU by default. (so 12 in my case)
If I don't set the -t argument at all, it takes 4 seconds with 12 threads.
Then I guess either your CPU is too slow, or you're running out of system memory and it's using swap (if that's the case, maybe using a quantized version of t5xxxl could help).
@Green-Sky Is there any plan to run t5xxl on GPU?
it seems smth is off with it, that same t5xxl step on ryzen 5600x the python cpu implementation (from diffusers lib) running 67 seconds faster (237%) than CPP implementation
and that's actually the best of sd-cpp measurements, with 6 threads, when setting to 12 (nproc) - it becomes 22 seconds even slower
- python diffusers (cpu) - 52 seconds
- sd-cpp 3 threads - 199 seconds
- sd-cpp 6 threads - 119 seconds
- sd-cpp 12 (nproc) threads - 141 seconds
- sd-cpp 16 threads - 147 seconds
offtopic: inference step takes ~9 seconds for both diffusers gpu and sd-cpp gpu, however for achieving the same quality of result diffusers taking 4 steps, and sd-cpp needs 20 - but i guess the latter is smth to do with the default parameters
i found the solution for my problem with high condition graph computing time, mb would be useful for someone else:
because i was building the package on old xeon server, but running on a new ryzen workstation - apparently some cpu optimizations were disabled during compilation
so after re-compiling it on the workstation itself - condition graph computes now in 18 seconds (34 seconds faster than diffusers-cpu and 100 seconds faster than before, without those optimizations) in the same testcase
Just some stat.
- Compiled with
-DGGML_NATIVE=OFFso it's usingSSE4.2, and run with-t 1on Intel i9-10900 it takes 15 minutes for t5xxl_fp16 step. Without-t(which is 20 threads) it takes 101 seconds. - Compiled with
-DGGML_NATIVE=ONon AMD EPYC 7543, so it's only AVX2&F16C-optimized, and run on i9-10900 without-t(i.e. 20 threads) it takes 11 seconds for t5 step. Upstreamsdbinary from the master-10feacf release page results in almost the same speed. - Previous build, when run on even newer
Xeon(R) Gold 5420+(with 8 threads) takes 30 seconds.
@stduhpf I'm super curious: what hardware do you have to run it in just 3.5 seconds on 24 threads, and how do you build sd? Is it built for avx512*?
@vt-alt I don't have anything special, just a R9 5900X (12 cores/24 threads, with AVX2 ans F16C), with 32 GB dual-channel DDR4@3333MHz. I'm thinking memory bandwidth might be a bottleneck for computing the conditioning? (Also I'm using a q4_k quantization for t5xxl, this can have a big influence)
Also I'm using a q4_k quantization for t5xxl
that must be it, i can confirm that quantization affects the time a lot